Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR AI

Date Submitted: Nov 14, 2024
Date Accepted: Apr 27, 2025

The final, peer-reviewed published version of this preprint can be found here:

Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study

Garcia-Carmona AM, Prieto ML, Puertas E, Beunza JJ

Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study

JMIR AI 2025;4:e68776

DOI: 10.2196/68776

PMID: 40608403

PMCID: 12271962

Enhanced medical data extraction: leveraging LLMs for accurate retrieval of patient information from medical reports

  • Angel Manuel Garcia-Carmona; 
  • Maria-Lorena Prieto; 
  • Enrique Puertas; 
  • Juan-Jose Beunza

ABSTRACT

Background:

The digital transformation of healthcare has introduced both opportunities and challenges, particularly in managing and analyzing the vast amounts of unstructured medical data generated daily. There is a need to explore the feasibility of generative solutions in extracting data from medical reports, categorized by specific criteria.

Objective:

This study investigates the application of Large Language Models (LLMs) for the automated extraction of structured information from unstructured medical reports, employing the LangChain framework in Python.

Methods:

Through a systematic evaluation of leading LLMs—GPT-4o, LLaMA 3, LLaMA 3.1, Gemma 2, Qwen 2, and Qwen 2.5—using zero-shot prompting techniques and embedding results into a vector database, the research assesses their performance in extracting patient demographics, diagnostic details, and pharmacological data.

Results:

Evaluation metrics, including accuracy, precision, recall, and F1 scores, revealed high efficacy across most categories, with GPT-4o achieving the highest overall performance (91.4% accuracy).

Conclusions:

The findings highlight notable differences in precision and recall between models, particularly in extracting names and age-related information. Challenges in processing unstructured medical text, including variability in model performance across data types, are discussed. The study demonstrates the feasibility of integrating LLMs into healthcare workflows, offering significant improvements in data accessibility and supporting clinical decision-making processes. Additionally, it explores the role of retrieval-augmented generation (RAG) techniques in enhancing information retrieval accuracy, addressing issues such as hallucinations and outdated data in LLM outputs. Future work emphasizes the need for optimization through larger and more diverse training datasets, advanced prompting strategies, and the integration of domain-specific knowledge to improve model generalizability and precision.


 Citation

Please cite as:

Garcia-Carmona AM, Prieto ML, Puertas E, Beunza JJ

Leveraging Large Language Models for Accurate Retrieval of Patient Information From Medical Reports: Systematic Evaluation Study

JMIR AI 2025;4:e68776

DOI: 10.2196/68776

PMID: 40608403

PMCID: 12271962

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.