Accepted for/Published in: JMIR Medical Informatics
Date Submitted: May 29, 2025
Date Accepted: Oct 13, 2025
Performance of the Large Language Models on the Chinese National Nurse Licensure Examination: A Cross-Sectional Evaluation Study
ABSTRACT
Background:
Large language models are increasingly explored in nursing education, but their capabilities in specialized, high-stakes, culturally-specific examinations like the Chinese National Nurse Licensure Examination remain under-evaluated. Accurate assessment is crucial before their adoption in nursing training and practice.
Objective:
To evaluate the performance, accuracy, repeatability, confidence, and robustness of four large language models on the Chinese National Nurse Licensure Examination.
Methods:
Four large language models (Sider Fusion, GPT-4o, Gemini 2.0 Pro, DeepSeek V3) were tested on 237 multiple-choice questions from the 2024 Chinese National Nurse Licensure Examination. Accuracy and repeatability were assessed using two prompting strategies. Confidence was evaluated via self-ratings (1-10 scale), and robustness via repeated adversarial prompting.
Results:
DeepSeek V3 and Gemini 2.0 Pro demonstrated significantly higher overall accuracy (> 83%) compared to GPT-4o and Sider Fusion (< 71%). However, all large language models showed suboptimal repeatability (< 87% consistency). Critically, poor confidence calibration was evident; most models showed high confidence often mismatching actual accuracy (P < 0.05). A stability-flexibility trade-off paradox was also observed.
Conclusions:
While some large language models show promising accuracy on the Chinese National Nurse Licensure Examination, fundamental reliability limitations (poor confidence calibration, inconsistent repeatability) hinder safe application in nursing education and practice. Prioritizing trustworthiness and calibrated reliability over surface accuracy is essential for future large language model development.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.