JMIR Preprints #78279: Performance of the Large Language Models on the Chinese National Nurse Licensure Examination: A Cross-Sectional Evaluation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of the Large Language Models on the Chinese National Nurse Licensure Examination: A Cross-Sectional Evaluation Study

Longhui Xu;
Xiao Cong;
Renxiu Wang;
Na Li;
Xinru Liu;
Ronghui Wang;
Cuiping Xu

ABSTRACT

Background:

Large language models are increasingly explored in nursing education, but their capabilities in specialized, high-stakes, culturally-specific examinations like the Chinese National Nurse Licensure Examination remain under-evaluated. Accurate assessment is crucial before their adoption in nursing training and practice.

Objective:

To evaluate the performance, accuracy, repeatability, confidence, and robustness of four large language models on the Chinese National Nurse Licensure Examination.

Methods:

Four large language models (Sider Fusion, GPT-4o, Gemini 2.0 Pro, DeepSeek V3) were tested on 237 multiple-choice questions from the 2024 Chinese National Nurse Licensure Examination. Accuracy and repeatability were assessed using two prompting strategies. Confidence was evaluated via self-ratings (1-10 scale), and robustness via repeated adversarial prompting.

Results:

DeepSeek V3 and Gemini 2.0 Pro demonstrated significantly higher overall accuracy (> 83%) compared to GPT-4o and Sider Fusion (< 71%). However, all large language models showed suboptimal repeatability (< 87% consistency). Critically, poor confidence calibration was evident; most models showed high confidence often mismatching actual accuracy (P < 0.05). A stability-flexibility trade-off paradox was also observed.

Conclusions:

While some large language models show promising accuracy on the Chinese National Nurse Licensure Examination, fundamental reliability limitations (poor confidence calibration, inconsistent repeatability) hinder safe application in nursing education and practice. Prioritizing trustworthiness and calibrated reliability over surface accuracy is essential for future large language model development.

Citation

Please cite as:

Xu L, Cong X, Wang R, Li N, Liu X, Wang R, Xu C

Performance of the Large Language Models on the Chinese National Nurse Licensure Examination: Cross-Sectional Evaluation Study

JMIR Med Inform 2025;13:e78279

DOI: 10.2196/78279

PMID: 41184207

PMCID: 12582878

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: May 29, 2025

Date Accepted: Oct 13, 2025

Performance of the Large Language Models on the Chinese National Nurse Licensure Examination: A Cross-Sectional Evaluation Study

ABSTRACT

Citation

Copyright