Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Human Factors

Date Submitted: May 23, 2025
Open Peer Review Period: May 23, 2025 - Jul 18, 2025
Date Accepted: Nov 17, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Multiple Large Language Models’ Performance on the Chinese Medical Licensing Examination: Quantitative Comparative Study

Diao Y, Wu M, Xu J, Pan Y

Multiple Large Language Models’ Performance on the Chinese Medical Licensing Examination: Quantitative Comparative Study

JMIR Hum Factors 2025;12:e77978

DOI: 10.2196/77978

PMID: 41401211

PMCID: 12707437

A Comparative Study of Multiple Big Language Models in the Chinese National Medical Licensing Examination: A Quantitative Research

  • Yanyu Diao; 
  • Mengyuan Wu; 
  • Jingwen Xu; 
  • Yifeng Pan

ABSTRACT

Background:

Background:

Chat Generative Pre-trained Transformer (ChatGPT), a 175-billion-parameter natural language processing model, excels in natural language tasks but its performance in Chinese, especially Chinese medical education, is underexplored. Meanwhile, Chinese corpus-based models like ERNIE Bot, Tongyi Qianwen, Doubao, and DeepSeek have emerged. ERNIE Bot offers enhanced knowledge and dialogue, Tongyi Qianwen supports multilingual tasks, Doubao provides diverse assistance functions, and DeepSeek improves language interaction. Their applications and performances in the Chinese National Medical Licensing Examination remain to be investigated.

Objective:

OBJECTIVE: To quantitatively compare the performance of multiple large language models (GPT-3.5, GPT-4, ERNIE Bot, Tongyi Qianwen, Doubao, DeepSeek) in answering questions from the Chinese National Medical Licensing Examination (NMLE) and analyze their feasibility in Chinese medical education through user-interpretable responses.

Methods:

Methods:

This study employed the default GPT-3.5-based ChatGPT model, the GPT-4 model available to ChatGPT-plus users, ERNIE Bot, Tongyi Qianwen, Doubao, and DeepSeek. To assess the performance of these six models in the Chinese National Medical Licensing Examination (NMLE) spanning 2018 - 2024, we selected questions from the four content units of the NMLE's General Written Examination. We systematically input these medical licensing exam questions into each model, collecting the generated responses. Subsequently, a structured evaluation process was conducted to analyze the accuracy, comprehensiveness, and logical coherence of the models' answers, thereby quantitatively comparing their performance in this specialized medical assessment context.

Results:

Results:

GPT-4 outperformed GPT-3.5 across all exam units, achieving average accuracies of 66.57%-80.67%, while Chinese models like DeepSeek and ERNIE Bot demonstrated strong performance, with DeepSeek consistently scoring highest among them (427-473 points) and all models exceeding the passing threshold (360 points).

Conclusions:

Conclusions:GPT-4 and Chinese-developed LLMs like DeepSeek show potential as supplementary tools in Chinese medical education, though they require further optimization for complex reasoning and real-world application while maintaining human expertise as central.


 Citation

Please cite as:

Diao Y, Wu M, Xu J, Pan Y

Multiple Large Language Models’ Performance on the Chinese Medical Licensing Examination: Quantitative Comparative Study

JMIR Hum Factors 2025;12:e77978

DOI: 10.2196/77978

PMID: 41401211

PMCID: 12707437

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.