JMIR Preprints #64844: Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 versus LLaMA2 for case reports

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 versus LLaMA2 for case reports

Takanobu Hirosawa;
Yukinori Harada;
Kazuki Tokumasu;
Tatsuya Shiraishi;
Tomoharu Suzuki;
Taro Shimizu

ABSTRACT

Background:

Generative artificial intelligence (AI), particularly in the form of large language models (LLMs), has rapidly developed. The LLM by Meta AI (LLaMA) series are popular and recently updated from LLaMA2 to LLaMA3. However, impacts of the update in diagnostic performance have not been well documented.

Objective:

We conducted a comparative evaluation of the diagnostic performance in differential diagnosis lists generated by LLaMA3 and LLaMA2 for case reports.

Methods:

We analyzed case reports published in the American Journal of Case Reports from 2022 to 2023. After excluding non-diagnostic and pediatric cases, we input the remaining cases into LLaMA3 and LLaMA2 using the same prompt and the same adjustable parameters. Diagnostic performance was defined by whether the differential diagnosis lists included the final diagnosis. Multiple physicians independently evaluated whether the final diagnosis was included in the top 10 differentials generated by LLaMA3 and LLaMA2.

Results:

In our comparative evaluation of the diagnostic performance between LLaMA3 and LLaMA2, we analyzed differential diagnosis lists for 392 case reports. The final diagnosis was included in the top 10 differentials generated by LLaMA3 in 79.6% (312/392) of the cases, compared to 49.7% (195/392) for LLaMA2, indicating a statistically significant improvement (P <.001). Additionally, LLaMA3 showed higher performance in including the final diagnosis in the top 5 differentials, observed in 63.0% (247/392) of cases, compared to LLaMA2’s 38.0% (149/392, P <.001). Furthermore, the top diagnosis was accurately identified by LLaMA3 in 33.9% (133/392) of cases, significantly higher than the 22.7% (89/392) achieved by LLaMA2 (P <.001). The analysis across various medical specialties revealed variations in diagnostic performance with LLaMA3 consistently outperforming LLaMA2.

Conclusions:

The results reveal that the LLaMA3 model significantly outperforms LLaMA2 in terms of diagnostic performance, with a higher percentage of case reports having the final diagnosis listed within the top 10, top 5, and as the top diagnosis. Overall dDiagnostic performance improved almost 1.5 times from LLaMA2 to LLaMA3. These findings support the rapid development and continuous refinement of generative AI systems to enhance diagnostic processes in medicine. However, these findings should be carefully interpreted for clinical application, as generative AI, including the LLaMA series, has not been approved for medical applications such as AI-enhanced diagnostics. Clinical Trial: Not applicable

Citation

Please cite as:

Hirosawa T, Harada Y, Tokumasu K, Shiraishi T, Suzuki T, Shimizu T

Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 Versus LLaMA2 for Case Reports

JMIR Form Res 2024;8:e64844

DOI: 10.2196/64844

PMID: 39561356

PMCID: 11615545

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jul 28, 2024

Open Peer Review Period: Aug 1, 2024 - Sep 26, 2024

Date Accepted: Oct 1, 2024

(closed for review but you can still tweet)

Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 versus LLaMA2 for case reports

ABSTRACT

Citation

Copyright