Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jul 30, 2025
Date Accepted: Mar 6, 2026
Large Language Model Automated Extraction of Clinical Signs and Symptoms From Emergency Department Reports for Machine Learning Prediction Models: A Development and Validation Study
ABSTRACT
Background:
Most clinically relevant information in emergency department (ED) visits is documented in free text, hindering secondary use for research or decision-support. Large Language Model (LLM)-based feature extraction has been studied in radiology and pathology, but not yet investigated in ED settings. Privacy-preserving local LLMs could enable automated feature extraction for decision support without increasing physician workload.
Objective:
To evaluate whether a small open-source LLM (Qwen2.5:14B) can automatically extract sixteen clinical signs and symptoms from ED reports and use these as input for an appendicitis prediction model. LLM performance under minimal and optimized zero-shot prompts was assessed against researcher annotations (reference standard) and physician annotations.
Methods:
This retrospective study used 336 ED reports from patients presenting with acute abdominal pain to a Dutch teaching hospital (2016 - 2023). One hundred reports were randomly selected to develop a minimal and an optimized zero-shot prompt strategy. The remaining 236 reports, reserved for validation, were annotated by two ED physicians and processed by the LLM to extract sixteen signs and symptoms, covering binary, multi-class, and multi-label classification tasks. These features were used as input to the HIVE (History, Intake, Vitals, Examination) appendicitis prediction model. LLM extraction accuracy, sensitivity, and specificity were measured against researcher (reference standard) and physician annotations. The HIVE model’s area under the receiver operating characteristic curve (AUROC) was evaluated using LLM-extracted vs physician-annotated features.
Results:
Among 336 ED reports from patients with AAP (median age, 41 years [IQR, 22-62 years], 61% female), 50% had appendicitis. The LLM achieved weighted average accuracies of 0.910 (95% CI, ±0.018) with minimal prompts and 0.929 (95% CI, ±0.016) with optimized prompts, versus 0.961 (95% CI, ±0.012) and 0.951 (95% CI, ±0.015) for physicians. Corresponding HIVE model AUROCs were 0.871 (95% CI, ±0.019) and 0.911 (95% CI, ±0.014) with LLM inputs under the minimal and optimized prompts, compared to 0.917 (95% CI, ±0.015) and 0.924 (95% CI, ±0.018) for physician inputs.
Conclusions:
Small locally deployable LLMs can approach physician-level accuracy in extracting structured clinical data from free-text ED reports, while preserving patient-privacy, interpretability and statistical transparency for downstream diagnostic modeling.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.