Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: May 24, 2023
Date Accepted: Sep 29, 2023

The final, peer-reviewed published version of this preprint can be found here:

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study

Wilhelm TI, Roos JJ, Kaczmarczyk R

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study

J Med Internet Res 2023;25:e49324

DOI: 10.2196/49324

PMID: 37902826

PMCID: 10644179

Large Language Models for Therapy Recommendations across Three Clinical Specialties: A Comparative Study

  • Theresa Isabelle Wilhelm; 
  • Jonas Joachim Roos; 
  • Robert Kaczmarczyk

ABSTRACT

Background:

Large language models (LLMs) have shown potential to generate medical information. However, the quality, safety, and reliability of these AI-generated contents need to be assessed.

Objective:

This study aimed to evaluate the performance of four LLMs (Claude-instant-v1.0, GPT-3.5-Turbo, command-xlarge-nightly, and Bloomz) in generating medical information across three specialties: ophthalmology, orthopedics, and dermatology.

Methods:

Three physicians assessed the quality of AI-generated therapeutic recommendations for 60 diseases using mDISCERN criteria, correctness, and harmfulness ratings, with each physician focusing on their respective field of expertise. ANOVA and pairwise T-tests were conducted to analyze differences in the quality and safety of generated content among the models and specialties. GPT-4 was utilized to rate all 60 diseases based on the same criteria and compared to the physicians’ model ratings using Pearson's correlation analysis.

Results:

Claude-instant-v1.0 achieved the highest mean mDISCERN score (3.35; CI 3.23 - 3.46), while Bloomz had the lowest (1.07; CI 1.03 - 1.10). Significant differences were observed among the models (p<.001). In terms of falseness ratings, significant differences were found between models (p<.001) and specialties (p<.001), in case of harmfulness, significant differences were only found between the models (p<.001). GPT-3.5-Turbo had the lowest harmfulness rating. Pearson's correlation analysis demonstrated significant alignment between physician and GPT-4 generated ratings across all criteria (p<.01).

Conclusions:

The evaluated LLMs showed potential for generating helpful medical information but require further improvements to address concerns related to harmfulness and falseness of content. Results indicate the importance of ongoing systematic evaluation and refinement of AI models to ensure reliable and safe medical information generation. Moreover, an automatic evaluation method using GPT-4 was outlined in this study that can be transferred to different domains and scores outside of therapy recommendation evaluation.


 Citation

Please cite as:

Wilhelm TI, Roos JJ, Kaczmarczyk R

Large Language Models for Therapy Recommendations Across 3 Clinical Specialties: Comparative Study

J Med Internet Res 2023;25:e49324

DOI: 10.2196/49324

PMID: 37902826

PMCID: 10644179

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.