Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Dec 11, 2024
Date Accepted: May 4, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Cross-Linguistic Diagnostic Performance of large language models: ChatGPT-4o and 4 Open-Source Models on China's Rare Disease Catalog
ABSTRACT
Background:
Diagnosing rare diseases remains challenging due to their inherent complexity and limited medical knowledge. The emergence of Large Language Models (LLMs) has introduced new opportunities to assist in the diagnostic process.
Objective:
This research evaluates the diagnostic accuracy of ChatGPT-4o against four open-source LLMs (qwen2.5:7b, Llama3.1:8b, qwen2.5:72b, Llama3.1:70b) for rare diseases, assesses the effect of language on their diagnostic performance, and investigates the potential applications of Retrieval Augmented Generation (RAG) technology within these models.
Methods:
Clinical manifestations of 121 rare diseases from China’s first rare disease catalog (2018) were collected from public websites. ChatGPT-4o was tasked with providing a primary diagnosis and a list of five differential diagnoses based on these clinical presentations. Subsequently, 20 cases were randomly chosen for assessing the diagnostic accuracy of four open-source LLMs in both English and Chinese. The study also re-evaluated the diagnostic performance of the least accurate open-source model using RAG. Diagnostic accuracy comparisons across models under different conditions were conducted using Chi-square tests. Additionally, a survey questionnaire was administered to 11 clinical practitioners across various specialties to assess their knowledge of the rare disease catalog.
Results:
ChatGPT-4o demonstrated the highest diagnostic accuracy with 90.1%. A series of 36 Chi-square tests were conducted on the nine diagnostic datasets. In the English context, ChatGPT-4o significantly outperformed the least accurate open-source model, qwen2.5:7b (90.1% vs. 50%, RR= 1.80, P<0.001). Comparing Llama3.1:8b with qwen2.5:7b, the former did not show a prominent diagnostic capability (40%) in the Chinese (RR=0.8, p=0.525), but it did exhibit higher (80%) in the English (RR=1.60, p=0.047), approaching that of ChatGPT 4o (RR=0.89, p=0.188). Both larger-parameter models, Llama3.1:70b and qwen2.5:72b, showed no significant difference in diagnostic ability compared to ChatGPT 4o, even for the model with the lowest accuracy, qwen2.5:72b (RR=0.83, p=0.055). The remaining results also indicated that the diagnostic capability of open-source models for rare diseases varies under different languages, parameters, and providers. The application of RAG significantly enhanced the diagnostic accuracy of qwen2.5:7b to 70%, with a retrieval accuracy rate of 85%. Survey results revealed that clinical practitioners across various specialties generally lack sufficient understanding of the rare disease.
Conclusions:
ChatGPT-4o has demonstrated impressive diagnostic prowess for rare diseases. Llama3.1:8b has proven effective in English-speaking contexts, yet for Chinese applications, models with over 70 billion parameters may be required. The parameter size of LLMs, the user's language, and the pre-training data's source are all critical factors to consider. Moreover, the integration of RAG strategies can notably enhance the diagnostic precision of open-source LLMs in the realm of rare diseases.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.