JMIR Preprints #70708: Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction from Electronic Health Records: Technical Implementation Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction from Electronic Health Records: Technical Implementation Study

Marvin N. Carlisle;
William A. Pace;
Andrew W. Liu;
Robert Krumm;
Janet E. Cowan;
Peter R. Carroll;
Matthew R. Cooperberg;
Anobel Y. Odisho

ABSTRACT

Background:

Manual abstraction of unstructured clinical data is often necessary for granular clinical outcomes research but is time consuming and can be of variable quality. Large language models (LLMs) show promise in medical data extraction yet integrating them into research workflows remains challenging and poorly described.

Objective:

To develop and integrate an LLM-based system for automated data extraction from unstructured electronic health record (EHR) text reports within an established clinical outcomes database.

Methods:

We implemented a generative AI pipeline (UODBLLM) utilizing a flexible language model interface that supports various LLM implementations, including HIPAA-compliant cloud services and local open-source models. We used XML-structured prompts and integrated using an open database connectivity interface to generate structured data from clinical documentation in the EHR. We evaluated UODBLLM's performance on completion rate, processing time, and extraction capabilities across multiple clinical data elements, including quantitative measurements, categorical assessments, and anatomical descriptions. System reliability was tested across multiple batches to assess scalability and consistency.

Results:

Piloted against MRI reports, UODBLLM processed 1,800 clinical documents with a 100% completion rate and an average processing time of 8.90 seconds per report. Token utilization averaged 2,692 tokens per report, with an input-to-output ratio of approximately 6.5:1, resulting in a processing cost of $0.009 per report. UODBLLM had consistent performance across 18 batches of 100 reports each and completed all processing in 4.45 hours. From each report, UODBLLM extracted 16 structured clinical elements, including prostate volume, PSA values, PI-RADS scores, clinical staging, and anatomical assessments. All extracted data was automatically validated against predefined schemas and stored in standardized JSON format.

Conclusions:

We demonstrated successful integration of an LLM-based extraction system within an existing clinical outcomes database, achieving rapid, comprehensive data extraction at minimal cost. UODBLLM provides a scalable, efficient solution for automating clinical data extraction while maintaining PHI security. This approach could significantly accelerate research timelines and expand feasible clinical studies, particularly for large-scale database projects.

Citation

Please cite as:

Carlisle MN, Pace WA, Liu AW, Krumm R, Cowan JE, Carroll PR, Cooperberg MR, Odisho AY

Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction From Electronic Health Records: Technical Implementation Study

JMIR Bioinform Biotech 2026;7:e70708

DOI: 10.2196/70708

PMID: 41494167

PMCID: 12774397

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Bioinformatics and Biotechnology

Date Submitted: Dec 30, 2024

Open Peer Review Period: Jan 13, 2025 - Mar 10, 2025

Date Accepted: Oct 21, 2025

(closed for review but you can still tweet)

Development and Validation of a Generative Artificial Intelligence-Based Pipeline for Automated Clinical Data Extraction from Electronic Health Records: Technical Implementation Study

ABSTRACT

Citation

Copyright