Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: May 24, 2023
Date Accepted: Sep 29, 2023
Large Language Models for Therapy Recommendations across Three Clinical Specialties: A Comparative Study
ABSTRACT
Background:
Large language models (LLMs) have shown potential to generate medical information. However, the quality, safety, and reliability of these AI-generated contents need to be assessed.
Objective:
This study aimed to evaluate the performance of four LLMs (Claude-instant-v1.0, GPT-3.5-Turbo, command-xlarge-nightly, and Bloomz) in generating medical information across three specialties: ophthalmology, orthopedics, and dermatology.
Methods:
Three physicians assessed the quality of AI-generated therapeutic recommendations for 60 diseases using mDISCERN criteria, correctness, and harmfulness ratings, with each physician focusing on their respective field of expertise. ANOVA and pairwise T-tests were conducted to analyze differences in the quality and safety of generated content among the models and specialties. GPT-4 was utilized to rate all 60 diseases based on the same criteria and compared to the physicians’ model ratings using Pearson's correlation analysis.
Results:
Claude-instant-v1.0 achieved the highest mean mDISCERN score (3.35; CI 3.23 - 3.46), while Bloomz had the lowest (1.07; CI 1.03 - 1.10). Significant differences were observed among the models (p<.001). In terms of falseness ratings, significant differences were found between models (p<.001) and specialties (p<.001), in case of harmfulness, significant differences were only found between the models (p<.001). GPT-3.5-Turbo had the lowest harmfulness rating. Pearson's correlation analysis demonstrated significant alignment between physician and GPT-4 generated ratings across all criteria (p<.01).
Conclusions:
The evaluated LLMs showed potential for generating helpful medical information but require further improvements to address concerns related to harmfulness and falseness of content. Results indicate the importance of ongoing systematic evaluation and refinement of AI models to ensure reliable and safe medical information generation. Moreover, an automatic evaluation method using GPT-4 was outlined in this study that can be transferred to different domains and scores outside of therapy recommendation evaluation.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.