Previously submitted to: Journal of Medical Internet Research (no longer under consideration since Jan 28, 2026)
Date Submitted: Jul 31, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
A Comparison of Large Language Models in Support for Different Stakeholders against the Fentanyl Crisis:Performance evaluation of multiple models
ABSTRACT
Background:
The fentanyl crisis is an urgent public health challenge, where the lack of knowledge causes an increasing mortality due to overdose. Large language models (LLMs) have shown great potential in medical fields, such as telemedicine and health education, while their benefits for different stakeholders in combating the fentanyl crisis warrants further investigation.
Objective:
This study aims to systematically evaluate the quality differences in real-time fentanyl-related guidance provided by six LLMs to users, first responders, clinicians, and policymakers. Clarify the advantages and disadvantages of different LLMS in the four major scenarios of identifying fentanyl, implementing emergency rescue, clinical diagnosis and treatment, and public health decision-making. To provide evidence-based evidence for the construction of a precise, reliable and multilingual fentanyl crisis intervention tool based on LLM, in order to reduce the risk of excessive deaths caused by knowledge gaps.
Methods:
We compared six LLMs, i.e., ChatGPT 3.5, Gemini 1.5 Flash, YouChat Smart, Copilot, Perplexity and Luzia regarding their ability to answer fentanyl-related questions. The performance of the models in various scenarios was scored by two experts and analyzed using analysis of variation (ANOVA), linear mixed models (LMM), and Cohen’s Kappa consistency test.
Results:
The LLMs performance significantly differed between question types (p<0.05 in ANOVA), whilst LMM confirmed that ChatGPT outperformed all other models across categories, with the largest effect sizes found when comparing ChatGPT to Gemini Bard 1.5 Flash and Copilot Bing Chat. Individually, Gemini performed well in user-related questions, but is relatively weak in first-aid-related questions. Luzia on WhatsApp performs moderately in first-aid-related questions, but poorly in clinical and policy-making ones. Perplexity scores are relatively high in clinical questions, but its overall consistency is poor. YouChat Smart and Copilot generally scored low in all scenarios and had poor stability.
Conclusions:
LLMs can provide real-time guidance for users, first aiders, clinicians, and policymakers, with different in performance between LLMs in different types of questions. The selection of LLM in answering fentanyl-related questions should be based on specific scenarios.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.