Currently submitted to: JMIR Formative Research
Date Submitted: Mar 26, 2026
Open Peer Review Period: Mar 30, 2026 - May 25, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Performance of Large Language Models on Board-Style Obstetrics and Gynecology Questions: a cross-sectional study
ABSTRACT
Background:
Large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated strong performance on general medical knowledge assessments, but their accuracy in specialty specific domains like Obstetrics and Gynecology (OBGYN) is less well characterized. Prior studies suggest high overall performance, but topic-specific proficiency across OBGYN subspecialties has not yet been evaluated, highlighting the need to assess their performance to inform safe integration into resident use and education.
Objective:
To benchmark the accuracy of contemporary LLMs on OBGYN knowledge using board-style question stems across subspecialty domains.
Methods:
We selected 50 questions from each of six Personal Review of Learning in Obstetrics and Gynecology (PROLOG) volumes, covering core OBGYN topics (total 300 questions). Three LLMs (ChatGPT-4, Claude 3.5, and Llama 3.1) were prompted to answer the entire set of 300 questions in topic-based blocks of 50. This was repeated over six independent sessions, totaling 1,800 question entries for each model, to obtain an average performance measure and minimize memory bias. Model responses to each individual question were graded against the answer key provided in the PROLOG volumes. We utilized a binary scoring system at the individual question level. A response was ‘correct’ only if it matched the single best answer as defined by the PROLOG volume, and ‘incorrect’ if it did not. Average performance across sessions was compared against the 2024 national Council on Resident Education in Obstetrics and Gynecology (CREOG) resident exam average as a contextual benchmark. Kruskal-Wallis tests, pairwise comparisons, and effect size comparisons using Cohen’s d were used to assess differences in performance across models and topics.
Results:
Overall accuracies were: 76% (Claude 3.5), 70% (ChatGPT-4), and 67% (Llama 3.1). Claude 3.5 outperformed the other models overall and in most topic areas, with the largest differences observed in Obstetrics and Reproductive Endocrinology. Accuracy was highest in Patient Management in the Office (84–86% across models) and lowest in Urogynecology and Pelvic Reconstructive Surgery (59–69%). Although comparisons are limited because PROLOG and CREOG questions are not identical, the reported national CREOG average serves as an indirect contextual benchmark. Within this context, average LLM performance on PROLOG questions (67%-76%) exceeded the reported national CREOG average across all resident levels (66%), but ChatGPT-4 (70%) and Llama 3.1 (67%) did not reach the average performance level of a PGY-4 resident (71%).
Conclusions:
LLM accuracy overlapped with reported national CREOG averages. Claude 3.5 outperformed ChatGPT-4 and Llama 3.1, exceeding PGY-4 accuracy. While promising as educational adjuncts, LLMs currently operate at a trainee-level and should complement, not replace, traditional clinical training.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.