Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Sep 15, 2023
Date Accepted: Jun 20, 2024

The final, peer-reviewed published version of this preprint can be found here:

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

Ming S, Guo Q, Cheng W, Lei B

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

JMIR Med Educ 2024;10:e52784

DOI: 10.2196/52784

PMID: 39140269

PMCID: 11336778

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Model Evolution and System Roles Influence the Performance of ChatGPT on Chinese Medical Licensing Exams: A Comparative Study

  • Shuai Ming; 
  • Qingge Guo; 
  • Wenjun Cheng; 
  • Bo Lei

ABSTRACT

Background:

With the increasing application of Large Language Models (LLMs) like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.

Objective:

To assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).

Methods:

The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical sub-specialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompts designation of system roles tailored to medical sub-specialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and Kappa values were employed to evaluate the model's accuracy and consistency.

Results:

GPT-4.0 achieved passing accuracy of (71.0% - 74.7%), significantly higher than that of GPT-3.5 (50.3% - 54.8%, P < 0.001). Both models showed relatively high coherence between initial and 2nd response, with Kappa values of 0.778 and 0.610. System roles boosted accuracy for both GPT-4.0 (0.3% - 3.7%) and GPT-3.5 (1.3% - 4.5%), and increased the Kappa by 0.023 and 0.035 respectively. In multi-specialty analysis, GPT-4.0 passed the threshold in 14 of 15 sub-specialties, while GPT-3.5 did so in 7 of 15 on the first response.

Conclusions:

GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical sub-specialty expertise. Adding a system role enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.


 Citation

Please cite as:

Ming S, Guo Q, Cheng W, Lei B

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

JMIR Med Educ 2024;10:e52784

DOI: 10.2196/52784

PMID: 39140269

PMCID: 11336778

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.