JMIR Preprints #96181: Performance of Large Language Models on Board-Style Obstetrics and Gynecology Questions: a cross-sectional study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of Large Language Models on Board-Style Obstetrics and Gynecology Questions: a cross-sectional study

Elise Yates;
Alwyn Mathew;
Jaclyn Cole;
Alan Pan;
Paula Mateo-Kubach;
Maria Santia;
Tommaso Meschini;
Marcos Sosa;
Konrad Harms;
Pedro Ramirez;
Behrouz Zand

ABSTRACT

Background:

Large language models (LLMs) such as ChatGPT, Claude, and Llama have demonstrated strong performance on general medical knowledge assessments, but their accuracy in specialty specific domains like Obstetrics and Gynecology (OBGYN) is less well characterized. Prior studies suggest high overall performance, but topic-specific proficiency across OBGYN subspecialties has not yet been evaluated, highlighting the need to assess their performance to inform safe integration into resident use and education.

Objective:

To benchmark the accuracy of contemporary LLMs on OBGYN knowledge using board-style question stems across subspecialty domains.

Methods:

We selected 50 questions from each of six Personal Review of Learning in Obstetrics and Gynecology (PROLOG) volumes, covering core OBGYN topics (total 300 questions). Three LLMs (ChatGPT-4, Claude 3.5, and Llama 3.1) were prompted to answer the entire set of 300 questions in topic-based blocks of 50. This was repeated over six independent sessions, totaling 1,800 question entries for each model, to obtain an average performance measure and minimize memory bias. Model responses to each individual question were graded against the answer key provided in the PROLOG volumes. We utilized a binary scoring system at the individual question level. A response was ‘correct’ only if it matched the single best answer as defined by the PROLOG volume, and ‘incorrect’ if it did not. Average performance across sessions was compared against the 2024 national Council on Resident Education in Obstetrics and Gynecology (CREOG) resident exam average as a contextual benchmark. Kruskal-Wallis tests, pairwise comparisons, and effect size comparisons using Cohen’s d were used to assess differences in performance across models and topics.

Results:

Overall accuracies were: 76% (Claude 3.5), 70% (ChatGPT-4), and 67% (Llama 3.1). Claude 3.5 outperformed the other models overall and in most topic areas, with the largest differences observed in Obstetrics and Reproductive Endocrinology. Accuracy was highest in Patient Management in the Office (84–86% across models) and lowest in Urogynecology and Pelvic Reconstructive Surgery (59–69%). Although comparisons are limited because PROLOG and CREOG questions are not identical, the reported national CREOG average serves as an indirect contextual benchmark. Within this context, average LLM performance on PROLOG questions (67%-76%) exceeded the reported national CREOG average across all resident levels (66%), but ChatGPT-4 (70%) and Llama 3.1 (67%) did not reach the average performance level of a PGY-4 resident (71%).

Conclusions:

LLM accuracy overlapped with reported national CREOG averages. Claude 3.5 outperformed ChatGPT-4 and Llama 3.1, exceeding PGY-4 accuracy. While promising as educational adjuncts, LLMs currently operate at a trainee-level and should complement, not replace, traditional clinical training.

Citation

Please cite as:

Yates E, Mathew A, Cole J, Pan A, Mateo-Kubach P, Santia M, Meschini T, Sosa M, Harms K, Ramirez P, Zand B

Performance of Large Language Models on Board-Style Obstetrics and Gynecology Questions: a cross-sectional study

JMIR Preprints. 26/03/2026:96181

DOI: 10.2196/preprints.96181

URL: https://preprints.jmir.org/preprint/96181

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Formative Research

Date Submitted: Mar 26, 2026

Open Peer Review Period: Mar 30, 2026 - May 25, 2026

(currently open for review)

Performance of Large Language Models on Board-Style Obstetrics and Gynecology Questions: a cross-sectional study

ABSTRACT

Citation

Copyright