JMIR Preprints #52784: Model Evolution and System Roles Influence the Performance of ChatGPT on Chinese Medical Licensing Exams: A Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Model Evolution and System Roles Influence the Performance of ChatGPT on Chinese Medical Licensing Exams: A Comparative Study

Shuai Ming;
Qingge Guo;
Wenjun Cheng;
Bo Lei

ABSTRACT

Background:

With the increasing application of Large Language Models (LLMs) like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research.

Objective:

To assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE).

Methods:

The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical sub-specialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompts designation of system roles tailored to medical sub-specialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and Kappa values were employed to evaluate the model's accuracy and consistency.

Results:

GPT-4.0 achieved passing accuracy of (71.0% - 74.7%), significantly higher than that of GPT-3.5 (50.3% - 54.8%, P < 0.001). Both models showed relatively high coherence between initial and 2nd response, with Kappa values of 0.778 and 0.610. System roles boosted accuracy for both GPT-4.0 (0.3% - 3.7%) and GPT-3.5 (1.3% - 4.5%), and increased the Kappa by 0.023 and 0.035 respectively. In multi-specialty analysis, GPT-4.0 passed the threshold in 14 of 15 sub-specialties, while GPT-3.5 did so in 7 of 15 on the first response.

Conclusions:

GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical sub-specialty expertise. Adding a system role enhanced the model's reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study.

Citation

Please cite as:

Ming S, Guo Q, Cheng W, Lei B

Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study

JMIR Med Educ 2024;10:e52784

DOI: 10.2196/52784

PMID: 39140269

PMCID: 11336778

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Sep 15, 2023

Date Accepted: Jun 20, 2024

Model Evolution and System Roles Influence the Performance of ChatGPT on Chinese Medical Licensing Exams: A Comparative Study

ABSTRACT

Citation

Copyright