Accepted for/Published in: JMIR Human Factors
Date Submitted: May 23, 2025
Open Peer Review Period: May 23, 2025 - Jul 18, 2025
Date Accepted: Nov 17, 2025
(closed for review but you can still tweet)
A Comparative Study of Multiple Big Language Models in the Chinese National Medical Licensing Examination: A Quantitative Research
ABSTRACT
Background:
Background:
Chat Generative Pre-trained Transformer (ChatGPT), a 175-billion-parameter natural language processing model, excels in natural language tasks but its performance in Chinese, especially Chinese medical education, is underexplored. Meanwhile, Chinese corpus-based models like ERNIE Bot, Tongyi Qianwen, Doubao, and DeepSeek have emerged. ERNIE Bot offers enhanced knowledge and dialogue, Tongyi Qianwen supports multilingual tasks, Doubao provides diverse assistance functions, and DeepSeek improves language interaction. Their applications and performances in the Chinese National Medical Licensing Examination remain to be investigated.
Objective:
OBJECTIVE: To quantitatively compare the performance of multiple large language models (GPT-3.5, GPT-4, ERNIE Bot, Tongyi Qianwen, Doubao, DeepSeek) in answering questions from the Chinese National Medical Licensing Examination (NMLE) and analyze their feasibility in Chinese medical education through user-interpretable responses.
Methods:
Methods:
This study employed the default GPT-3.5-based ChatGPT model, the GPT-4 model available to ChatGPT-plus users, ERNIE Bot, Tongyi Qianwen, Doubao, and DeepSeek. To assess the performance of these six models in the Chinese National Medical Licensing Examination (NMLE) spanning 2018 - 2024, we selected questions from the four content units of the NMLE's General Written Examination. We systematically input these medical licensing exam questions into each model, collecting the generated responses. Subsequently, a structured evaluation process was conducted to analyze the accuracy, comprehensiveness, and logical coherence of the models' answers, thereby quantitatively comparing their performance in this specialized medical assessment context.
Results:
Results:
GPT-4 outperformed GPT-3.5 across all exam units, achieving average accuracies of 66.57%-80.67%, while Chinese models like DeepSeek and ERNIE Bot demonstrated strong performance, with DeepSeek consistently scoring highest among them (427-473 points) and all models exceeding the passing threshold (360 points).
Conclusions:
Conclusions:GPT-4 and Chinese-developed LLMs like DeepSeek show potential as supplementary tools in Chinese medical education, though they require further optimization for complex reasoning and real-world application while maintaining human expertise as central.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.