Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Oct 17, 2023
Date Accepted: Apr 23, 2024

The final, peer-reviewed published version of this preprint can be found here:

Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases

Chiu WHK, Ko WSK, Cho WCS, Hui SYJ, Chan WCL, Kuo MD

Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases

J Med Internet Res 2024;26:e53724

DOI: 10.2196/53724

PMID: 38739441

PMCID: 11130768

Evaluating the Diagnostic Performance of Large Language Models on Complex Multi-Modal Medical Cases

  • Wan Hang Keith Chiu; 
  • Wei Sum Koel Ko; 
  • William Chi Shing Cho; 
  • Sin Yu Joanne Hui; 
  • Wing Chi Lawrence Chan; 
  • Michael D Kuo

ABSTRACT

Background:

Large language models (LLMs) have demonstrated surprising performance on radiological examinations (1). However, their proficiency in real-world medical reasoning, especially when integrating multi-modal data remains uncertain (2).

Objective:

This study evaluates the ability of 3 commonly used LLMs (BARD, Claude2, and GPT-4) to generate differential diagnoses (ddx) from complex multi-modality diagnostic cases.

Methods:

Consecutive “Case Records of the Massachusetts General Hospital” from 07/2020–06/2023 containing clinical, biochemical and radiological information were selected (3). The cases were diagnostically challenging with a final diagnosis provided. Only the case presentation and a simple prompt asking for top 5 ddx were used as input. Each case was run independently to prevent the model from being influenced by prior cases. To enable objective assessment, all diagnoses were mapped to their corresponding 10th revision International Classification of Diseases (ICD-10) codes, with higher-level codes used if an exact code could not be assigned (Table 1). The primary objective was accuracy, measured by whether the final diagnosis was within the LLM-generated ddx at the ICD-10 Category level. The secondary objectives were to measure the similarity between diagnoses within a ddx and its final diagnosis, as well as their similarity to each other, measured at the ICD-10 Chapter level. Chi-square and ANOVA were used to compare categorical data between LLMs. Statistical analysis was performed using Prism 10 (GraphPad Software, USA).

Results:

The diagnostic accuracies on 104 evaluated cases were 27.9%, 30.8% and 31.7% for BARD, Claude2 and GPT-4, respectively. Accuracy significantly improved at the ICD-10 Chapter (body site or system) level, reaching 65.3%, 66.3%, and 71.1% respectively. All 3 LLMs showed evidence of interpretive reasoning as they tended to generate ddx whose member diagnoses were often related to each other (median number of ddx per case belonging to the same ICD-10 chapter as each other was 3.0 for all 3 LLMs (SD 1.1, 1.1 and 0.9 for BARD, Claude2 and GPT-4)). Interestingly, these related diagnosis “clusters” were often unrelated to the final diagnosis (median number of ddx belonging to the same ICD-10 chapter as the final diagnosis was 1.0 for all 3 LLMs, (SD 1.3, 1.4 and 1.2 for BARD, Claude2 and GPT-4)). These two findings were irrespective of whether the LLMs were able to include the final diagnosis in their ddx. Furthermore, performance of the LLMs varied by disease etiology although not statistically significant (Table 2).

Conclusions:

This study rigorously evaluates the diagnostic capacity of multiple LLMs using a simple standardized prompt (4). The 3 LLMs represent state-of-the-art, general LLMs, accessible to most clinicians. The relatively low accuracies of all 3 models at the ICD-10 category level, coupled with a median of 3/5 diagnoses residing in a chapter outside of the final diagnosis chapter, together, suggests either a knowledge or reasoning gap in current LLMs. Conversely, the moderate number of LLM-generated ddx belonging to the same body site/system (chapter) implies these models can integrate and reason across complex clinical findings. Limitations include not assessing whether human-AI interaction or prompt engineering would affect diagnostic accuracy. Nevertheless, attempts to “overengineer” general LLMs towards a desired output could cloud real-world applicability, detracting from the ease-of-use that make current LLMs attractive to general users (5). Future work includes analyzing the rationales provided by the LLMs in reaching their ddx and asking the LLMs to quantify the likelihood of each ddx. Finally, the diversity of LLM-generated ddx warrant further exploration as it could potentially hamper patient management (6). In conclusion, LLMs may have a role in enhancing physician diagnosis on complex, multi-modal clinical cases when applied judiciously. Clinical Trial: N/A


 Citation

Please cite as:

Chiu WHK, Ko WSK, Cho WCS, Hui SYJ, Chan WCL, Kuo MD

Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases

J Med Internet Res 2024;26:e53724

DOI: 10.2196/53724

PMID: 38739441

PMCID: 11130768

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.