Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Oct 18, 2024
Date Accepted: Feb 12, 2025

The final, peer-reviewed published version of this preprint can be found here:

Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study

Pastrak M, Kajitani S, Goodings AJ, Drewek A, Lafree A, Murphy A

Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study

JMIR AI 2025;4:e67696

DOI: 10.2196/67696

PMID: 40611478

PMCID: 12231519

Evaluation of ChatGPT Performance on Emergency Medicine Board Exam Questions: Observational Study

  • Mila Pastrak; 
  • Sten Kajitani; 
  • Anthony James Goodings; 
  • Austin Drewek; 
  • Andrew Lafree; 
  • Adrian Murphy

ABSTRACT

Background:

The ever-evolving field of medicine has highlighted the potential for ChatGPT as an assistive platform. However, its use in medical board exam preparation and completion remains divided.

Objective:

This study aimed to evaluate the performance of a custom-modified version of ChatGPT-4, tailored with emergency medicine board exam preparatory materials (Anki flashcard deck), compared to its default version and previous iteration (3.5). The goal was to assess the accuracy of ChatGPT-4 answering board-style questions and its suitability as a tool to aid students and trainees in standardized examination preparation.

Methods:

A comparative analysis was conducted using a random selection of 598 questions from the Rosh In-Training Exam Question Bank. The subjects of the study included three versions of ChatGPT: the Default, a Custom, and ChatGPT-3.5. Accuracy, response length, medical discipline subgroups, and underlying causes of error were analyzed.

Results:

The Custom version did not demonstrate a significant improvement in accuracy over the Default version (P=.61), though both significantly outperformed ChatGPT-3.5 (P<.001). Default produced significantly longer responses than the Custom, 1371±444 and 929¬±408, respectively¬ (P<.001). Subgroup analysis revealed no significant difference in the performance across different medical sub-disciplines between the versions (P>.05 in all cases). Both ChatGPT-4’s had similar underlying error types (P>.05 in all cases) and had a 99% predicted probability of passing while ChatGPT-3.5 had an 85% probability.

Conclusions:

The findings suggest that while newer versions of ChatGPT exhibit improved performance in emergency medicine board exam preparation, specific enhancement with a comprehensive Anki flashcard deck on the topic does not significantly impact accuracy. The study highlights the potential of ChatGPT-4 as a tool for medical education, capable of providing accurate support across a wide range of topics in emergency medicine in its default form.


 Citation

Please cite as:

Pastrak M, Kajitani S, Goodings AJ, Drewek A, Lafree A, Murphy A

Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study

JMIR AI 2025;4:e67696

DOI: 10.2196/67696

PMID: 40611478

PMCID: 12231519

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.