Accepted for/Published in: JMIR Bioinformatics and Biotechnology
Date Submitted: Dec 30, 2024
Open Peer Review Period: Jan 13, 2025 - Mar 10, 2025
Date Accepted: Oct 21, 2025
(closed for review but you can still tweet)
Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction from Electronic Health Records: Technical Implementation Study
ABSTRACT
Background:
Manual abstraction of unstructured clinical data is often necessary for granular clinical outcomes research but is time consuming and can be of variable quality. Large language models (LLMs) show promise in medical data extraction yet integrating them into research workflows remains challenging and poorly described.
Objective:
To develop and integrate an LLM-based system for automated data extraction from unstructured electronic health record (EHR) text reports within an established clinical outcomes database.
Methods:
We implemented a generative AI pipeline (UODBLLM) utilizing a flexible language model interface that supports various LLM implementations, including HIPAA-compliant cloud services and local open-source models. We used XML-structured prompts and integrated using an open database connectivity interface to generate structured data from clinical documentation in the EHR. We evaluated UODBLLM's performance on completion rate, processing time, and extraction capabilities across multiple clinical data elements, including quantitative measurements, categorical assessments, and anatomical descriptions. System reliability was tested across multiple batches to assess scalability and consistency.
Results:
Piloted against MRI reports, UODBLLM processed 1,800 clinical documents with a 100% completion rate and an average processing time of 8.90 seconds per report. Token utilization averaged 2,692 tokens per report, with an input-to-output ratio of approximately 6.5:1, resulting in a processing cost of $0.009 per report. UODBLLM had consistent performance across 18 batches of 100 reports each and completed all processing in 4.45 hours. From each report, UODBLLM extracted 16 structured clinical elements, including prostate volume, PSA values, PI-RADS scores, clinical staging, and anatomical assessments. All extracted data was automatically validated against predefined schemas and stored in standardized JSON format.
Conclusions:
We demonstrated successful integration of an LLM-based extraction system within an existing clinical outcomes database, achieving rapid, comprehensive data extraction at minimal cost. UODBLLM provides a scalable, efficient solution for automating clinical data extraction while maintaining PHI security. This approach could significantly accelerate research timelines and expand feasible clinical studies, particularly for large-scale database projects.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.