Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 3, 2023
Open Peer Review Period: Jul 3, 2023 - Jul 18, 2023
Date Accepted: Sep 5, 2023
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study

Huang RS, Lu KJQ, Meaney C, Kemppainen J, Punnett A, Leung FH

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study

JMIR Med Educ 2023;9:e50514

DOI: 10.2196/50514

PMID: 37725411

PMCID: 10548315

Assessment of Resident and Artificial Intelligence Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: A Comparative Study

  • Ryan S. Huang; 
  • Kevin Jia Qi Lu; 
  • Christopher Meaney; 
  • Joel Kemppainen; 
  • Angela Punnett; 
  • Fok-Han Leung

ABSTRACT

Background:

Large Language Model (LLM)-based chatbots are developing at an unprecedented pace with the release of Chat Generative Pretrained Transformer (ChatGPT) and its successor, GPT-4. Their capabilities on general-purpose tasks, and language generation, has evolved to performing excellently on various educational examination benchmarks, including medical-knowledge tests. Comparison of these two LLM models’ performance to Family Medicine residents on a multiple choice medical-knowledge test can facilitate insight into their potential utility as a medical education tool.

Objective:

To quantitatively and qualitatively compare the performance of ChatGPT, GPT-4, and Family Medicine residents on a multiple-choice medical-knowledge test appropriate for the level of a Family Medicine resident.

Methods:

An official University of Toronto Department of Family and Community Medicine (DFCM) Progress Test consisting of multiple-choice questions was inputted into ChatGPT and GPT-4. The AI chatbot’s responses were manually reviewed to determine the selected answer, response length, response time, provision of a rationale for the outputted response, and the root cause of all incorrect responses (classified into statistical, logical, and information errors). The performance of the artificial intelligence chatbots were compared against a cohort of family medicine residents who concurrently attempted the test.

Results:

GPT-4 performed significantly better compared to ChatGPT (difference = 25.0%, 95% CI = [16.3%, 32.8%], McNemar’s test: p<0.0001), correctly answering 89/108 (82.4%) questions, while ChatGPT answered 62/108 (57.4%) correctly. Further, GPT-4 scored higher across all 11 categories of family medicine knowledge. In 86.1% of the responses, GPT-4 provided a rationale for why other multiple-choice options were not chosen compared to the 16.7% achieved by ChatGPT. Qualitatively, for both ChatGPT and GPT-4 responses, logical errors were the most common, while statistical errors were the least common. The average performance of family medicine residents was 56.9% [95% CI: 56.2%, 57.6%]. The performance of ChatGPT was similar to that of the average family medicine resident, while the performance of GPT-4 exceeded that of the top performing family medicine resident.

Conclusions:

GPT-4 significantly outperforms both ChatGPT and Family Medicine residents on a multiple choice medical-knowledge test designed for Family Medicine residents. GPT-4 provides a logical rationale for its response choice, ruling out other answer choices efficiently and with concise justification. Its high degree of accuracy and advanced reasoning capabilities facilitate its potential applications as a medical education tool, such as exam question/scenario creation and resource for medical knowledge or learning about community services.


 Citation

Please cite as:

Huang RS, Lu KJQ, Meaney C, Kemppainen J, Punnett A, Leung FH

Assessment of Resident and AI Chatbot Performance on the University of Toronto Family Medicine Residency Progress Test: Comparative Study

JMIR Med Educ 2023;9:e50514

DOI: 10.2196/50514

PMID: 37725411

PMCID: 10548315

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.