JMIR Preprints #63731: Large Language Models Outperform in The Chinese National Nursing Licensing Examination: A Retrospective Cross-sectional Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large Language Models Outperform in The Chinese National Nursing Licensing Examination: A Retrospective Cross-sectional Study

Shiben Zhu;
Wanqin Hu;
Zhi Yang;
Jiani Yan;
Fang Zhang

ABSTRACT

Background:

While large language models (LLMs) often produce impressive outputs, their performance in exams requiring strong reasoning skills and expert domain knowledge, such as the Chinese National Nursing Licensing Examination, remains uncertain.

Objective:

We aimed to assess the performance and educational value of justifications provided by large language models, including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5, on the Chinese National Nursing Licensing Examination. Additionally, we explored the feasibility of enhancing their performance by combining these models using machine learning techniques.

Methods:

This retrospective cross-sectional study analyzed multiple-choice questions (MCQs) from the Chinese National Nursing Licensing Examination using GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 twice from May 27 to June 27, 2024. The study also investigated machine learning techniques to swiftly enhance the performance metrics of these LLMs.

Results:

Qwen-2.5 achieved the maximum accuracy of 88.92% and the lowest variance of 0.099. Comparisons revealed varying degrees of statistical significance, notably between GPT-4.0 and GPT-4o (t-statistic = 3.27, p-value = 0.001) and GPT-4.0 and Qwen-2.5 (t-statistic = 2.31, p-value = 0.021). Qwen-2.5 exhibited the strongest correlation with correct answers (r=0.86), whereas GPT-3.5 showed the weakest correlation (r=0.40). Integration of the results from seven LLMs using machine learning and ensemble methods identified the Random Forest (RF) model as optimal for enhancing accuracy, achieving an AUC of 0.98, sensitivity of 0.88, specificity of 0.92, F1 score of 0.88, accuracy of 0.88, positive predictive value (PPV) of 0.88, and negative predictive value (NPV) of 1.00.

Conclusions:

Qwen-2.5 and GPT-4o emerged as the leading performers among the LLMs, with Qwen-2.5 excelling in the Chinese National Nursing Licensing Examination. Moreover, combining various LLMs through machine learning markedly enhanced accuracy, suggesting a promising direction for future applications. Clinical Trial: NA

Citation

Please cite as:

Zhu S, Hu W, Yang Z, Yan J, Zhang F

Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study

JMIR Med Inform 2025;13:e63731

DOI: 10.2196/63731

PMID: 39793017

PMCID: 11759905

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 27, 2024

Open Peer Review Period: Jun 28, 2024 - Aug 23, 2024

Date Accepted: Dec 20, 2024

(closed for review but you can still tweet)

Large Language Models Outperform in The Chinese National Nursing Licensing Examination: A Retrospective Cross-sectional Study

ABSTRACT

Citation

Copyright