Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Pure Wisdom or Potemkin Villages? – A Comparison of ChatGPT 3.5 Versus ChatGPT 4 Based on 1,840 AMBOSS© USMLE® Step 3 Style Questions

Leonard Knoedler;
Michael Alfertshofer;
Samuel Knoedler;
Cosima C. Hoch;
Paul F. Funk;
Sebastian Cotofana;
Bhagvat Maheta;
Konstantin Frank;
Vanessa Brébant;
Lukas Prantl;
Philipp Lamby

ABSTRACT

Background:

The United States Medical Licensing Exams (USMLE®) have been critical in medical education since 1992, testing various aspects of a medical student’s knowledge and skills through different Steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT’s performance on USMLE® Step 3 in large-scale scenarios, and comparing different versions of ChatGPT are limited.

Objective:

The aim of this paper was to analyze ChatGPT’s performance on USMLE® Step 3 practice test questions to better elucidate the strengths and weaknesses of AI utilization in medical education and deducing evidence-based strategies to counteract AI cheating.

Methods:

A total of n = 2,069 USMLE® Step 3 practice questions were extracted from the AMBOSS© study platform. After 229 image-based questions were included, a total of 1,840 text-based questions were further categorized and entered into ChatGPT 3.5 while a subset thereof with 229 questions were entered into ChatGPT 4. Responses were recorded and the accuracy of ChatGPT answers, its performance in different test question categories, and for different difficulty levels were compared between both versions.

Results:

Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) versus 56.9% (1,047/1,840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (rs= -0.069; P = 0.003), which was absent in ChatGPT 4 (P = 0.866). Additionally, the difficulty of test questions, as categorized by AMBOSS© hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with rs= -0.289 for ChatGPT 3.5 and rs= -0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the two highest difficulty tiers (four and five hammers), where statistical significance was not reached.

Conclusions:

In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE® Step 3 with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1,047/1,840). While ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring application of theoretical concepts particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI, and underline the promising role of AI in the realm of medical education and diagnostics.

Citation

Please cite as:

Knoedler L, Alfertshofer M, Knoedler S, Hoch CC, Funk PF, Cotofana S, Maheta B, Frank K, Brébant V, Prantl L, Lamby P

Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis

JMIR Med Educ 2024;10:e51148

DOI: 10.2196/51148

PMID: 38180782

PMCID: 10799278

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 22, 2023

Open Peer Review Period: Jul 22, 2023 - Sep 16, 2023

Date Accepted: Oct 20, 2023

(closed for review but you can still tweet)

Pure Wisdom or Potemkin Villages? – A Comparison of ChatGPT 3.5 Versus ChatGPT 4 Based on 1,840 AMBOSS© USMLE® Step 3 Style Questions

ABSTRACT

Citation

Copyright

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 22, 2023

Open Peer Review Period: Jul 22, 2023 - Sep 16, 2023

Date Accepted: Oct 20, 2023

(closed for review but you can still tweet)

Pure Wisdom or Potemkin Villages? – A Comparison of ChatGPT 3.5 Versus ChatGPT 4 Based on 1,840 AMBOSS© USMLE® Step 3 Style Questions

ABSTRACT

Citation

The author of this paper has made a PDF available, but requires the user to login, or create an account.

Copyright