Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 27, 2024
Open Peer Review Period: Jun 28, 2024 - Aug 23, 2024
Date Accepted: Dec 20, 2024
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Large Language Models Outperform in The Chinese National Nursing Licensing Examination: A Retrospective Cross-sectional Study
ABSTRACT
Background:
While large language models (LLMs) often produce impressive outputs, their performance in exams requiring strong reasoning skills and expert domain knowledge, such as the Chinese National Nursing Licensing Examination, remains uncertain.
Objective:
We aimed to assess the performance and educational value of justifications provided by large language models, including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5, on the Chinese National Nursing Licensing Examination. Additionally, we explored the feasibility of enhancing their performance by combining these models using machine learning techniques.
Methods:
This retrospective cross-sectional study analyzed multiple-choice questions (MCQs) from the Chinese National Nursing Licensing Examination using GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 twice from May 27 to June 27, 2024. The study also investigated machine learning techniques to swiftly enhance the performance metrics of these LLMs.
Results:
Qwen-2.5 achieved the maximum accuracy of 88.92% and the lowest variance of 0.099. Comparisons revealed varying degrees of statistical significance, notably between GPT-4.0 and GPT-4o (t-statistic = 3.27, p-value = 0.001) and GPT-4.0 and Qwen-2.5 (t-statistic = 2.31, p-value = 0.021). Qwen-2.5 exhibited the strongest correlation with correct answers (r=0.86), whereas GPT-3.5 showed the weakest correlation (r=0.40). Integration of the results from seven LLMs using machine learning and ensemble methods identified the Random Forest (RF) model as optimal for enhancing accuracy, achieving an AUC of 0.98, sensitivity of 0.88, specificity of 0.92, F1 score of 0.88, accuracy of 0.88, positive predictive value (PPV) of 0.88, and negative predictive value (NPV) of 1.00.
Conclusions:
Qwen-2.5 and GPT-4o emerged as the leading performers among the LLMs, with Qwen-2.5 excelling in the Chinese National Nursing Licensing Examination. Moreover, combining various LLMs through machine learning markedly enhanced accuracy, suggesting a promising direction for future applications. Clinical Trial: NA
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.