Accepted for/Published in: JMIR Medical Education
Date Submitted: Jul 3, 2023
Open Peer Review Period: Jul 3, 2023 - Jul 18, 2023
Date Accepted: Sep 5, 2023
(closed for review but you can still tweet)
Assessment of Resident and Artificial Intelligence Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: A Comparative Study
ABSTRACT
Background:
Large Language Model (LLM)-based chatbots are developing at an unprecedented pace with the release of Chat Generative Pretrained Transformer (ChatGPT) and its successor, GPT-4. Their capabilities on general-purpose tasks, and language generation, has evolved to performing excellently on various educational examination benchmarks, including medical-knowledge tests. Comparison of these two LLM models’ performance to Family Medicine residents on a multiple choice medical-knowledge test can facilitate insight into their potential utility as a medical education tool.
Objective:
To quantitatively and qualitatively compare the performance of ChatGPT, GPT-4, and Family Medicine residents on a multiple-choice medical-knowledge test appropriate for the level of a Family Medicine resident.
Methods:
An official University of Toronto Department of Family and Community Medicine (DFCM) Progress Test consisting of multiple-choice questions was inputted into ChatGPT and GPT-4. The AI chatbot’s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into statistical, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of family medicine residents who concurrently attempted the test.
Results:
GPT-4 performed significantly better compared to ChatGPT (difference = 25.0%, 95% CI = [16.3%, 32.8%], McNemar’s test: p<0.0001), correctly answering 89/108 (82.4%) questions, while ChatGPT answered 62/108 (57.4%) correctly. Further, GPT-4 scored higher across all 11 categories of family medicine knowledge. In 86.1% of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% achieved by ChatGPT. Qualitatively, for both ChatGPT and GPT-4 responses, logical errors were the most common, while statistical errors were the least common. The average performance of family medicine residents was 56.9% [95% CI: 56.2%, 57.6%]. The performance of ChatGPT was similar to that of the average family medicine resident, while the performance of GPT-4 exceeded that of the top performing family medicine resident.
Conclusions:
GPT-4 significantly outperforms both ChatGPT and Family Medicine residents on a multiple choice medical-knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications as a medical education tool, such as exam question/scenario creation and resource for medical knowledge or learning about community services.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.