Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Bioinformatics and Biotechnology

Date Submitted: Dec 30, 2024
Open Peer Review Period: Jan 13, 2025 - Mar 10, 2025
Date Accepted: Oct 21, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction From Electronic Health Records: Technical Implementation Study

Carlisle MN, Pace WA, Liu AW, Krumm R, Cowan JE, Carroll PR, Cooperberg MR, Odisho AY

Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction From Electronic Health Records: Technical Implementation Study

JMIR Bioinform Biotech 2026;7:e70708

DOI: 10.2196/70708

PMID: 41494167

PMCID: 12774397

Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction from Electronic Health Records: Technical Implementation Study

  • Marvin N. Carlisle; 
  • William A. Pace; 
  • Andrew W. Liu; 
  • Robert Krumm; 
  • Janet E. Cowan; 
  • Peter R. Carroll; 
  • Matthew R. Cooperberg; 
  • Anobel Y. Odisho

ABSTRACT

Background:

Manual abstraction of unstructured clinical data is often necessary for granular clinical outcomes research but is time consuming and can be of variable quality. Large language models (LLMs) show promise in medical data extraction yet integrating them into research workflows remains challenging and poorly described.

Objective:

To develop and integrate an LLM-based system for automated data extraction from unstructured electronic health record (EHR) text reports within an established clinical outcomes database.

Methods:

We implemented a generative AI pipeline (UODBLLM) utilizing a flexible language model interface that supports various LLM implementations, including HIPAA-compliant cloud services and local open-source models. We used XML-structured prompts and integrated using an open database connectivity interface to generate structured data from clinical documentation in the EHR. We evaluated UODBLLM's performance on completion rate, processing time, and extraction capabilities across multiple clinical data elements, including quantitative measurements, categorical assessments, and anatomical descriptions. System reliability was tested across multiple batches to assess scalability and consistency.

Results:

Piloted against MRI reports, UODBLLM processed 1,800 clinical documents with a 100% completion rate and an average processing time of 8.90 seconds per report. Token utilization averaged 2,692 tokens per report, with an input-to-output ratio of approximately 6.5:1, resulting in a processing cost of $0.009 per report. UODBLLM had consistent performance across 18 batches of 100 reports each and completed all processing in 4.45 hours. From each report, UODBLLM extracted 16 structured clinical elements, including prostate volume, PSA values, PI-RADS scores, clinical staging, and anatomical assessments. All extracted data was automatically validated against predefined schemas and stored in standardized JSON format.

Conclusions:

We demonstrated successful integration of an LLM-based extraction system within an existing clinical outcomes database, achieving rapid, comprehensive data extraction at minimal cost. UODBLLM provides a scalable, efficient solution for automating clinical data extraction while maintaining PHI security. This approach could significantly accelerate research timelines and expand feasible clinical studies, particularly for large-scale database projects.


 Citation

Please cite as:

Carlisle MN, Pace WA, Liu AW, Krumm R, Cowan JE, Carroll PR, Cooperberg MR, Odisho AY

Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction From Electronic Health Records: Technical Implementation Study

JMIR Bioinform Biotech 2026;7:e70708

DOI: 10.2196/70708

PMID: 41494167

PMCID: 12774397

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.