JMIR Preprints #77978: A Comparative Study of Multiple Big Language Models in the Chinese National Medical Licensing Examination: A Quantitative Research

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A Comparative Study of Multiple Big Language Models in the Chinese National Medical Licensing Examination: A Quantitative Research

Yanyu Diao;
Mengyuan Wu;
Jingwen Xu;
Yifeng Pan

ABSTRACT

Background:

Chat Generative Pre-trained Transformer (ChatGPT), a 175-billion-parameter natural language processing model, excels in natural language tasks but its performance in Chinese, especially Chinese medical education, is underexplored. Meanwhile, Chinese corpus-based models like ERNIE Bot, Tongyi Qianwen, Doubao, and DeepSeek have emerged. ERNIE Bot offers enhanced knowledge and dialogue, Tongyi Qianwen supports multilingual tasks, Doubao provides diverse assistance functions, and DeepSeek improves language interaction. Their applications and performances in the Chinese National Medical Licensing Examination remain to be investigated.

Objective:

OBJECTIVE: To quantitatively compare the performance of multiple large language models (GPT-3.5, GPT-4, ERNIE Bot, Tongyi Qianwen, Doubao, DeepSeek) in answering questions from the Chinese National Medical Licensing Examination (NMLE) and analyze their feasibility in Chinese medical education through user-interpretable responses.

Methods:

This study employed the default GPT-3.5-based ChatGPT model, the GPT-4 model available to ChatGPT-plus users, ERNIE Bot, Tongyi Qianwen, Doubao, and DeepSeek. To assess the performance of these six models in the Chinese National Medical Licensing Examination (NMLE) spanning 2018 - 2024, we selected questions from the four content units of the NMLE's General Written Examination. We systematically input these medical licensing exam questions into each model, collecting the generated responses. Subsequently, a structured evaluation process was conducted to analyze the accuracy, comprehensiveness, and logical coherence of the models' answers, thereby quantitatively comparing their performance in this specialized medical assessment context.

Results:

GPT-4 outperformed GPT-3.5 across all exam units, achieving average accuracies of 66.57%-80.67%, while Chinese models like DeepSeek and ERNIE Bot demonstrated strong performance, with DeepSeek consistently scoring highest among them (427-473 points) and all models exceeding the passing threshold (360 points).

Conclusions:

Conclusions：GPT-4 and Chinese-developed LLMs like DeepSeek show potential as supplementary tools in Chinese medical education, though they require further optimization for complex reasoning and real-world application while maintaining human expertise as central.

Citation

Please cite as:

Diao Y, Wu M, Xu J, Pan Y

Multiple Large Language Models’ Performance on the Chinese Medical Licensing Examination: Quantitative Comparative Study

JMIR Hum Factors 2025;12:e77978

DOI: 10.2196/77978

PMID: 41401211

PMCID: 12707437

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Human Factors

Date Submitted: May 23, 2025

Open Peer Review Period: May 23, 2025 - Jul 18, 2025

Date Accepted: Nov 17, 2025

(closed for review but you can still tweet)

A Comparative Study of Multiple Big Language Models in the Chinese National Medical Licensing Examination: A Quantitative Research

ABSTRACT

Citation

Copyright