Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Dec 30, 2025
Date Accepted: Jun 9, 2026
Evaluation of Large Language Models for Structured Data Extraction from Interstitial Lung Disease Clinical Notes: A Comparative Study
ABSTRACT
Background:
Most clinically relevant data is contained in unstructured text within clinical notes. Clinical notes are prone to verbosity and imprecision, making structured data extraction a major bottleneck and a costly endeavor when screening patients for studies, or creating and maintaining healthcare registries or databases
Objective:
We aim to compare the performance of various large language models (LLMs) for structured data extraction from unstructured interstitial lung disease (ILD) clinic notes. Our primary aim evaluated LLM extraction of binary structured data from clinical notes. A secondary analysis evaluated select LLMs for extraction of multi-class data.
Methods:
We used 12 different LLMs to extract binary answers to 10 ILD clinical questions from clinic notes for 100 ILD clinic patients. We additionally used 2 LLMs to extract multi-class data regarding ILD classification. Ground truth was established by consensus among three ILD physicians. LLM performance was evaluated using accuracy, precision, recall, and F1 scores.
Results:
LLMs processed each clinical note-prompt combination in 1-2 seconds, at an estimated cost of less than $0.02 for each note-prompt combination. Of the 12 LLMs assessed, Claude 3.5 Sonnet (Anthropic, San Francisco), GPT-4, GPT-4o-mini, GPT-4o, o1, o1-mini, o3-mini, gpt-oss-20b, and gpt-oss-120b (OpenAI, San Francisco) consistently achieved high accuracy, similar to that of the three ILD clinicians (96.2%). Multi-class data extraction demonstrated lower accuracy than binary data extraction.
Conclusions:
Multiple LLMs consistently achieved human level accuracy in extracting structured binary data from ILD clinical notes, while being orders of magnitude faster and cheaper. LLMs are promising tools that can be used for clinical data extraction to improve clinical research efficiency. Clinical Trial: None
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.