JMIR Preprints #92871: A Comparative Benchmark of 19 Large Language Models for Structured Data Extraction from Neurosurgical Clinical Records

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A Comparative Benchmark of 19 Large Language Models for Structured Data Extraction from Neurosurgical Clinical Records

Martin Černý;
Martin Májovský;
Sanju Lama;
Guo Edward;
Hana Hallak;
Rahul Singh;
Homer Riva-Cambrin;
Petr Šustek;
Lucie Široká;
Ivana Tullett;
Katarina Horčičáková;
Samuel Hužava;
Kateřina Sajfrídová;
Klára El-Haj Ahmadová;
Garnette Roy Sutherland;
David Netuka

ABSTRACT

Background:

Large language models (LLMs) are increasingly used to extract information from electronic health records (EHRs). Given the rapid pace of LLM development, robust scenario-specific benchmarks are essential to evaluate clinical usefulness and support safe deployment.

Objective:

To compare contemporary LLMs on structured data extraction from real neurosurgical EHRs written in the Czech language.

Methods:

In a prospective single-center cohort, 172 hospitalized patients provided informed consent for use of anonymized EHRs. For each patient, predefined records were collected and concatenated. Ground truth for 35 data points was established by dual extraction with consensus. A standardized prompt requesting JSON output was submitted to 19 LLMs. Primary outcome was overall accuracy; secondary outcomes were category-level accuracy and proportion of complete machine-readable outputs.

Results:

6,264 documents were collected (median 33 per patient). Ground truth was established with 92.6% initial inter-rater agreement before consensus seeking. Several models produced complete JSON outputs for 100% of cases (Claude 4.1 Opus, Grok 4, Gemini 2.5 Flash); GPT-4.1 (DeepSearch) and GPT-5 completed 99.4%. Highest accuracy was achieved by GPT-4.1 (87.6%), followed by GPT-4.5 (85.6%), Claude 4.1 (84.8%), and Grok 4 (84.2%). Accuracy declined by data type: binary (up to 95%), numeric (~89%), multiple-choice (~75%), and short text (~78%).

Conclusions:

Currently available LLMs can reliably extract structured clinical information from full, non-English EHRs, while older or smaller models show major limitations. A hybrid workflow—automated extraction with targeted validation—appears practical for research use.

Citation

Please cite as:

Černý M, Májovský M, Lama S, Edward G, Hallak H, Singh R, Riva-Cambrin H, Šustek P, Široká L, Tullett I, Horčičáková K, Hužava S, Sajfrídová K, El-Haj Ahmadová K, Sutherland GR, Netuka D

A Comparative Benchmark of 19 Large Language Models for Structured Data Extraction from Neurosurgical Clinical Records

JMIR Preprints. 04/02/2026:92871

DOI: 10.2196/preprints.92871

URL: https://preprints.jmir.org/preprint/92871

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Formative Research

Date Submitted: Feb 4, 2026

Open Peer Review Period: Feb 8, 2026 - Apr 5, 2026

(currently open for review)

A Comparative Benchmark of 19 Large Language Models for Structured Data Extraction from Neurosurgical Clinical Records

ABSTRACT

Citation

Copyright