Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Oct 18, 2024
Date Accepted: Feb 12, 2025

The final, peer-reviewed published version of this preprint can be found here:

Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study

Kajitani S, Pastrak M, Goodings A, Drewek A, Lafree A, Murphy A

Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study

JMIR AI 2025;4:e67696

DOI: 10.2196/67696

PMID: 40611478

PMCID: 12231519

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Generative AI’s Performance on Emergency Medicine Boards Questions: Observational Study

  • Sten Kajitani; 
  • Mila Pastrak; 
  • Anthony Goodings; 
  • Austin Drewek; 
  • Andrew Lafree; 
  • Adrian Murphy

ABSTRACT

Background:

The ever evolving field of medicine has highlighted the potential for ChatGPT as an assistive platform. However, its use in medical board exam preparation and completion remains divided.

Objective:

This study aimed to evaluate the performance of a custom-modified version of ChatGPT-4, tailored with emergency medicine board exam preparatory materials (Anki deck), compared to its default version and previous iteration (3.5). The goal was to assess the accuracy of ChatGPT-4 answering board- style questions and its suitability as a tool for medical education.

Methods:

A comparative analysis was conducted using a random selection of 598 questions from the Rosh In-Training Exam Question Bank. The subjects of the study included three versions of ChatGPT: the Default, a Custom, and ChatGPT-3.5. Accuracy, response length, medical discipline subgroups, and underlying causes of error were analyzed.

Results:

Custom did not demonstrate a significant improvement in accuracy over Default (p>0.05), though both significantly outperformed ChatGPT-3.5 (p<0.05). Default produced significantly longer responses than the Custom (p<0.05). Subgroup analysis revealed no significant difference in the performance across different medical sub-disciplines between the versions (p>0.05). Both ChatGPT-4’s had similar underlying errors (p>0.05) and had a 99% predicted probability of passing while ChatGPT- 3.5 had an 85%. 

Conclusions:

The findings suggest that while newer versions of ChatGPT exhibit improved performance in emergency medicine board exam preparation, specific enhancements do not significantly impact accuracy. The study highlights the potential of ChatGPT-4 as a tool for medical education, capable of providing accurate support across a wide range of topics in emergency medicine.


 Citation

Please cite as:

Kajitani S, Pastrak M, Goodings A, Drewek A, Lafree A, Murphy A

Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study

JMIR AI 2025;4:e67696

DOI: 10.2196/67696

PMID: 40611478

PMCID: 12231519

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.