Accepted for/Published in: JMIR Formative Research
Date Submitted: Jul 28, 2024
Open Peer Review Period: Aug 1, 2024 - Sep 26, 2024
Date Accepted: Oct 1, 2024
(closed for review but you can still tweet)
Comparative Analysis of Diagnostic Performance: Differential Diagnosis Lists by LLaMA3 versus LLaMA2 for case reports
ABSTRACT
Background:
Generative artificial intelligence (AI), particularly in the form of large language models (LLMs), has rapidly developed. The LLM by Meta AI (LLaMA) series are popular and recently updated from LLaMA2 to LLaMA3. However, impacts of the update in diagnostic performance have not been well documented.
Objective:
We conducted a comparative evaluation of the diagnostic performance in differential diagnosis lists generated by LLaMA3 and LLaMA2 for case reports.
Methods:
We analyzed case reports published in the American Journal of Case Reports from 2022 to 2023. After excluding non-diagnostic and pediatric cases, we input the remaining cases into LLaMA3 and LLaMA2 using the same prompt and the same adjustable parameters. Diagnostic performance was defined by whether the differential diagnosis lists included the final diagnosis. Multiple physicians independently evaluated whether the final diagnosis was included in the top 10 differentials generated by LLaMA3 and LLaMA2.
Results:
In our comparative evaluation of the diagnostic performance between LLaMA3 and LLaMA2, we analyzed differential diagnosis lists for 392 case reports. The final diagnosis was included in the top 10 differentials generated by LLaMA3 in 79.6% (312/392) of the cases, compared to 49.7% (195/392) for LLaMA2, indicating a statistically significant improvement (P <.001). Additionally, LLaMA3 showed higher performance in including the final diagnosis in the top 5 differentials, observed in 63.0% (247/392) of cases, compared to LLaMA2’s 38.0% (149/392, P <.001). Furthermore, the top diagnosis was accurately identified by LLaMA3 in 33.9% (133/392) of cases, significantly higher than the 22.7% (89/392) achieved by LLaMA2 (P <.001). The analysis across various medical specialties revealed variations in diagnostic performance with LLaMA3 consistently outperforming LLaMA2.
Conclusions:
The results reveal that the LLaMA3 model significantly outperforms LLaMA2 in terms of diagnostic performance, with a higher percentage of case reports having the final diagnosis listed within the top 10, top 5, and as the top diagnosis. Overall dDiagnostic performance improved almost 1.5 times from LLaMA2 to LLaMA3. These findings support the rapid development and continuous refinement of generative AI systems to enhance diagnostic processes in medicine. However, these findings should be carefully interpreted for clinical application, as generative AI, including the LLaMA series, has not been approved for medical applications such as AI-enhanced diagnostics. Clinical Trial: Not applicable
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.