Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Dec 11, 2024
Date Accepted: May 4, 2025

The final, peer-reviewed published version of this preprint can be found here:

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study

Zhong W, Liu Y, Liu Y, Yang K, Gao H, Yan H, Hao W, Yan Y, Yin C

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study

J Med Internet Res 2025;27:e69929

DOI: 10.2196/69929

PMID: 40532199

PMCID: 12192912

Diagnostic Performance of ChatGPT-4o and Four Open-Source Large Language Models on China's Rare Disease Catalog: Comparative Study

  • Wei Zhong; 
  • YiFan Liu; 
  • Yan Liu; 
  • Kai Yang; 
  • HuiMin Gao; 
  • HuiHui Yan; 
  • WenJing Hao; 
  • YouSheng Yan; 
  • ChengHong Yin

ABSTRACT

Background:

Diagnosing rare diseases remains challenging due to their inherent complexity and limited medical knowledge. The emergence of Large Language Models (LLMs) has introduced new opportunities to assist in the diagnostic process.

Objective:

This research evaluates the diagnostic accuracy of ChatGPT-4o against four open-source LLMs (qwen2.5:7b, Llama3.1:8b, qwen2.5:72b, Llama3.1:70b) for rare diseases, assesses the effect of language on their diagnostic performance, and investigates the potential applications of Retrieval Augmented Generation (RAG) technology within these models.

Methods:

Clinical manifestations of 121 rare diseases from China’s first rare disease catalog (2018) were collected from public websites. ChatGPT-4o was tasked with providing a primary diagnosis and a list of five differential diagnoses based on these clinical presentations. Subsequently, 20 cases were randomly chosen for assessing the diagnostic accuracy of four open-source LLMs in both English and Chinese. The study also re-evaluated the diagnostic performance of the least accurate open-source model using RAG. Diagnostic accuracy comparisons across models under different conditions were conducted using Chi-square tests. Additionally, a survey questionnaire was administered to 11 clinical practitioners across various specialties to assess their knowledge of the rare disease catalog.

Results:

ChatGPT-4o demonstrated the highest diagnostic accuracy with 90.1%. A series of 36 Chi-square tests were conducted on the nine diagnostic datasets. In the English context, ChatGPT-4o significantly outperformed the least accurate open-source model, qwen2.5:7b (90.1% vs. 50%, RR= 1.80, P<0.001). Comparing Llama3.1:8b with qwen2.5:7b, the former did not show a prominent diagnostic capability (40%) in the Chinese (RR=0.8, p=0.525), but it did exhibit higher (80%) in the English (RR=1.60, p=0.047), approaching that of ChatGPT 4o (RR=0.89, p=0.188). Both larger-parameter models, Llama3.1:70b and qwen2.5:72b, showed no significant difference in diagnostic ability compared to ChatGPT 4o, even for the model with the lowest accuracy, qwen2.5:72b (RR=0.83, p=0.055). The remaining results also indicated that the diagnostic capability of open-source models for rare diseases varies under different languages, parameters, and providers. The application of RAG significantly enhanced the diagnostic accuracy of qwen2.5:7b to 70%, with a retrieval accuracy rate of 85%. Survey results revealed that clinical practitioners across various specialties generally lack sufficient understanding of the rare disease.

Conclusions:

ChatGPT-4o has demonstrated impressive diagnostic prowess for rare diseases. Llama3.1:8b has proven effective in English-speaking contexts, yet for Chinese applications, models with over 70 billion parameters may be required. The parameter size of LLMs, the user's language, and the pre-training data's source are all critical factors to consider. Moreover, the integration of RAG strategies can notably enhance the diagnostic precision of open-source LLMs in the realm of rare diseases.


 Citation

Please cite as:

Zhong W, Liu Y, Liu Y, Yang K, Gao H, Yan H, Hao W, Yan Y, Yin C

Performance of ChatGPT-4o and Four Open-Source Large Language Models in Generating Diagnoses Based on China’s Rare Disease Catalog: Comparative Study

J Med Internet Res 2025;27:e69929

DOI: 10.2196/69929

PMID: 40532199

PMCID: 12192912

Per the author's request the PDF is not available.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.