Accepted for/Published in: JMIR Medical Education
Date Submitted: Jul 3, 2023
Open Peer Review Period: Jul 3, 2023 - Jul 18, 2023
Date Accepted: Sep 5, 2023
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Assessment of Resident and Artificial Intelligence Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: A Comparative Study
ABSTRACT
Background:
Large Language Model (LLM)-based chatbots are developing at an unprecedented pace with the release of Chat Generative Pretrained Transformer (ChatGPT) and its successor, GPT-4. Their capabilities on general-purpose tasks, and language generation, has evolved to performing excellently on various educational examination benchmarks, including medical-knowledge tests. Comparison of these two LLM models’ performance to Family Medicine residents on a multiple choice medical-knowledge test can facilitate insight into their potential utility as a medical education tool.
Objective:
To quantitatively and qualitatively compare the performance of ChatGPT, GPT-4, and Family Medicine residents on a multiple-choice medical-knowledge test appropriate for the level of a Family Medicine resident.
Methods:
An official University of Toronto Department of Family and Community Medicine (DFCM) Progress Test consisting of multiple-choice questions was inputted into ChatGPT and GPT-4. The AI chatbot’s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into statistical, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of family medicine residents who concurrently attempted the test.
Results:
The average performance of family medicine residents was 56.9% [95% confidence interval: 56.2%, 57.6%]. GPT-4 performed significantly better compared to ChatGPT (difference = 24.1%, 95% CI = [19.8%, 28.5%], McNemar’s χ2 = 15.19, p = 0.0002), correctly answering 89/108 (82.4%) questions, while ChatGPT answered 62/108 (57.4%) correctly. The percentage of correctly answered questions on the family medicine progress test was 62/108 (57.4%) for ChatGPT compared to 89/108 (82.4%) for GPT-4. The 25.5% [95% CI: 16.3%-32.8%] improvement in percentage of correctly identified questions for GPT-4 compared to ChatGPT was statistically significant [McNemar’s test: P<0.0001]. Further, GPT-4 scored higher across all 11 categories of family medicine knowledge. In 86.1% of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% achieved by ChatGPT. Qualitatively, for both ChatGPT and GPT-4 responses, logical errors were the most common, while statistical errors were the least common. The performance of ChatGPT was similar to that of the average family medicine resident; whereas, the performance of GPT-4 exceeded that of the top performing family medicine resident.
Conclusions:
GPT-4 significantly outperforms both ChatGPT and Family Medicine residents on a multiple choice medical-knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications as a medical education tool, such as exam question/scenario creation and resource for medical knowledge or learning about community services.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.