JMIR Preprints #49324: Large Language Models for Therapy Recommendations across Three Clinical Specialties: A Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large Language Models for Therapy Recommendations across Three Clinical Specialties: A Comparative Study

Theresa Isabelle Wilhelm;
Jonas Joachim Roos;
Robert Kaczmarczyk

ABSTRACT

Background:

Large language models (LLMs) have shown potential to generate medical information. However, the quality, safety, and reliability of these AI-generated contents need to be assessed.

Objective:

This study aimed to evaluate the performance of four LLMs (Claude-instant-v1.0, GPT-3.5-Turbo, command-xlarge-nightly, and Bloomz) in generating medical information across three specialties: ophthalmology, orthopedics, and dermatology.

Methods:

Three physicians assessed the quality of AI-generated therapeutic recommendations for 60 diseases using mDISCERN criteria, correctness, and harmfulness ratings, with each physician focusing on their respective field of expertise. ANOVA and pairwise T-tests were conducted to analyze differences in the quality and safety of generated content among the models and specialties. GPT-4 was utilized to rate all 60 diseases based on the same criteria and compared to the physicians’ model ratings using Pearson's correlation analysis.

Results:

Claude-instant-v1.0 achieved the highest mean mDISCERN score (3.35; CI 3.23 - 3.46), while Bloomz had the lowest (1.07; CI 1.03 - 1.10). Significant differences were observed among the models (p<.001). In terms of falseness ratings, significant differences were found between models (p<.001) and specialties (p<.001), in case of harmfulness, significant differences were only found between the models (p<.001). GPT-3.5-Turbo had the lowest harmfulness rating. Pearson's correlation analysis demonstrated significant alignment between physician and GPT-4 generated ratings across all criteria (p<.01).

Conclusions:

The evaluated LLMs showed potential for generating helpful medical information but require further improvements to address concerns related to harmfulness and falseness of content. Results indicate the importance of ongoing systematic evaluation and refinement of AI models to ensure reliable and safe medical information generation. Moreover, an automatic evaluation method using GPT-4 was outlined in this study that can be transferred to different domains and scores outside of therapy recommendation evaluation.

Citation

Please cite as:

Wilhelm TI, Roos JJ, Kaczmarczyk R

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study

J Med Internet Res 2023;25:e49324

DOI: 10.2196/49324

PMID: 37902826

PMCID: 10644179

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: May 24, 2023

Date Accepted: Sep 29, 2023

Large Language Models for Therapy Recommendations across Three Clinical Specialties: A Comparative Study

ABSTRACT

Citation

Copyright