Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 27, 2024
Open Peer Review Period: Jun 28, 2024 - Aug 23, 2024
Date Accepted: Dec 20, 2024
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study

Zhu S, Hu W, Yang Z, Yan J, Zhang F

Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study

JMIR Med Inform 2025;13:e63731

DOI: 10.2196/63731

PMID: 39793017

PMCID: 11759905

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Large Language Models Outperform in The Chinese National Nursing Licensing Examination: A Retrospective Cross-sectional Study

  • Shiben Zhu; 
  • Wanqin Hu; 
  • Zhi Yang; 
  • Jiani Yan; 
  • Fang Zhang

ABSTRACT

Background:

While large language models (LLMs) often produce impressive outputs, their performance in exams requiring strong reasoning skills and expert domain knowledge, such as the Chinese National Nursing Licensing Examination, remains uncertain.

Objective:

We aimed to assess the performance and educational value of justifications provided by large language models, including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5, on the Chinese National Nursing Licensing Examination. Additionally, we explored the feasibility of enhancing their performance by combining these models using machine learning techniques.

Methods:

This retrospective cross-sectional study analyzed multiple-choice questions (MCQs) from the Chinese National Nursing Licensing Examination using GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 twice from May 27 to June 27, 2024. The study also investigated machine learning techniques to swiftly enhance the performance metrics of these LLMs.

Results:

Qwen-2.5 achieved the maximum accuracy of 88.92% and the lowest variance of 0.099. Comparisons revealed varying degrees of statistical significance, notably between GPT-4.0 and GPT-4o (t-statistic = 3.27, p-value = 0.001) and GPT-4.0 and Qwen-2.5 (t-statistic = 2.31, p-value = 0.021). Qwen-2.5 exhibited the strongest correlation with correct answers (r=0.86), whereas GPT-3.5 showed the weakest correlation (r=0.40). Integration of the results from seven LLMs using machine learning and ensemble methods identified the Random Forest (RF) model as optimal for enhancing accuracy, achieving an AUC of 0.98, sensitivity of 0.88, specificity of 0.92, F1 score of 0.88, accuracy of 0.88, positive predictive value (PPV) of 0.88, and negative predictive value (NPV) of 1.00.

Conclusions:

Qwen-2.5 and GPT-4o emerged as the leading performers among the LLMs, with Qwen-2.5 excelling in the Chinese National Nursing Licensing Examination. Moreover, combining various LLMs through machine learning markedly enhanced accuracy, suggesting a promising direction for future applications. Clinical Trial: NA


 Citation

Please cite as:

Zhu S, Hu W, Yang Z, Yan J, Zhang F

Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study

JMIR Med Inform 2025;13:e63731

DOI: 10.2196/63731

PMID: 39793017

PMCID: 11759905

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.