JMIR Preprints #50514: Assessment of Resident and Artificial Intelligence Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: A Comparative Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Assessment of Resident and Artificial Intelligence Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: A Comparative Study

Ryan ST Huang;
Kevin Jia Qi Lu;
Christopher Meaney;
Joel Kemppainen;
Angela Punnett;
Fok-Han Leung

ABSTRACT

Background:

Large Language Model (LLM)-based chatbots are developing at an unprecedented pace with the release of Chat Generative Pretrained Transformer (ChatGPT) and its successor, GPT-4. Their capabilities on general-purpose tasks, and language generation, has evolved to performing excellently on various educational examination benchmarks, including medical-knowledge tests. Comparison of these two LLM models’ performance to Family Medicine residents on a multiple choice medical-knowledge test can facilitate insight into their potential utility as a medical education tool.

Objective:

To quantitatively and qualitatively compare the performance of ChatGPT, GPT-4, and Family Medicine residents on a multiple-choice medical-knowledge test appropriate for the level of a Family Medicine resident.

Methods:

An official University of Toronto Department of Family and Community Medicine (DFCM) Progress Test consisting of multiple-choice questions was inputted into ChatGPT and GPT-4. The AI chatbot’s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into statistical, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of family medicine residents who concurrently attempted the test.

Results:

The average performance of family medicine residents was 56.9% [95% confidence interval: 56.2%, 57.6%]. GPT-4 performed significantly better compared to ChatGPT (difference = 24.1%, 95% CI = [19.8%, 28.5%], McNemar’s χ2 = 15.19, p = 0.0002), correctly answering 89/108 (82.4%) questions, while ChatGPT answered 62/108 (57.4%) correctly. The percentage of correctly answered questions on the family medicine progress test was 62/108 (57.4%) for ChatGPT compared to 89/108 (82.4%) for GPT-4. The 25.5% [95% CI: 16.3%-32.8%] improvement in percentage of correctly identified questions for GPT-4 compared to ChatGPT was statistically significant [McNemar’s test: P<0.0001]. Further, GPT-4 scored higher across all 11 categories of family medicine knowledge. In 86.1% of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% achieved by ChatGPT. Qualitatively, for both ChatGPT and GPT-4 responses, logical errors were the most common, while statistical errors were the least common. The performance of ChatGPT was similar to that of the average family medicine resident; whereas, the performance of GPT-4 exceeded that of the top performing family medicine resident.

Conclusions:

GPT-4 significantly outperforms both ChatGPT and Family Medicine residents on a multiple choice medical-knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications as a medical education tool, such as exam question/scenario creation and resource for medical knowledge or learning about community services.

Citation

Please cite as:

Huang RS, Lu KJQ, Meaney C, Kemppainen J, Punnett A, Leung FH

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study

JMIR Med Educ 2023;9:e50514

DOI: 10.2196/50514

PMID: 37725411

PMCID: 10548315

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 3, 2023

Open Peer Review Period: Jul 3, 2023 - Jul 18, 2023

Date Accepted: Sep 5, 2023

(closed for review but you can still tweet)

Assessment of Resident and Artificial Intelligence Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: A Comparative Study

ABSTRACT

Citation

Copyright