Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jul 30, 2025
Date Accepted: Mar 6, 2026

The final, peer-reviewed published version of this preprint can be found here:

Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: Development and Validation Study

Schipper A, Belgers P, O'Connor R, van de Wouw L, Builtjes L, Bosma JS, Kusters R, Kurstjens S, Rutten M, van Ginneken B

Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: Development and Validation Study

JMIR Med Inform 2026;14:e81500

DOI: 10.2196/81500

Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: A Development and Validation Study

  • Anoeska Schipper; 
  • Peter Belgers; 
  • Rory O'Connor; 
  • Lieke van de Wouw; 
  • Luc Builtjes; 
  • Joeran Sander Bosma; 
  • Ron Kusters; 
  • Steef Kurstjens; 
  • Matthieu Rutten; 
  • Bram van Ginneken

ABSTRACT

Background:

Most clinically relevant information in emergency department (ED) visits is documented in free text, hindering secondary use for research or decision-support. Large Language Model (LLM)-based feature extraction has been studied in radiology and pathology, but not yet investigated in ED settings. Privacy-preserving local LLMs could enable automated feature extraction for decision support without increasing physician workload.

Objective:

To evaluate whether a small open-source LLM (Qwen2.5:14B) can automatically extract sixteen clinical signs and symptoms from ED reports and use these as input for an appendicitis prediction model. LLM performance under minimal and optimized zero-shot prompts was assessed against researcher annotations (reference standard) and physician annotations.

Methods:

This retrospective study used 336 ED reports from patients presenting with acute abdominal pain to a Dutch teaching hospital (2016 - 2023). One hundred reports were randomly selected to develop a minimal and an optimized zero-shot prompt strategy. The remaining 236 reports, reserved for validation, were annotated by two ED physicians and processed by the LLM to extract sixteen signs and symptoms, covering binary, multi-class, and multi-label classification tasks. These features were used as input to the HIVE (History, Intake, Vitals, Examination) appendicitis prediction model. LLM extraction accuracy, sensitivity, and specificity were measured against researcher (reference standard) and physician annotations. The HIVE model’s area under the receiver operating characteristic curve (AUROC) was evaluated using LLM-extracted vs physician-annotated features.

Results:

Among 336 ED reports from patients with AAP (median age, 41 years [IQR, 22-62 years], 61% female), 50% had appendicitis. The LLM achieved weighted average accuracies of 0.910 (95% CI, ±0.018) with minimal prompts and 0.929 (95% CI, ±0.016) with optimized prompts, versus 0.961 (95% CI, ±0.012) and 0.951 (95% CI, ±0.015) for physicians. Corresponding HIVE model AUROCs were 0.871 (95% CI, ±0.019) and 0.911 (95% CI, ±0.014) with LLM inputs under the minimal and optimized prompts, compared to 0.917 (95% CI, ±0.015) and 0.924 (95% CI, ±0.018) for physician inputs.

Conclusions:

Small locally deployable LLMs can approach physician-level accuracy in extracting structured clinical data from free-text ED reports, while preserving patient-privacy, interpretability and statistical transparency for downstream diagnostic modeling.


 Citation

Please cite as:

Schipper A, Belgers P, O'Connor R, van de Wouw L, Builtjes L, Bosma JS, Kusters R, Kurstjens S, Rutten M, van Ginneken B

Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: Development and Validation Study

JMIR Med Inform 2026;14:e81500

DOI: 10.2196/81500

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.