Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Feb 13, 2025
Date Accepted: Apr 22, 2025

The final, peer-reviewed published version of this preprint can be found here:

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report

Li R, Mao S, Zhu C, Yang Y, Tan C, Li L, Mu X, Liu H, Yang Y

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report

J Med Internet Res 2025;27:e72638

DOI: 10.2196/72638

PMID: 40499132

PMCID: 12176309

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Enhancing Pulmonary Disease Prediction Using Large Language Models with Feature Summarization and Hybrid Retrieval-Augmented Generation

  • Ronghao Li; 
  • Shuai Mao; 
  • Congmin Zhu; 
  • Yingliang Yang; 
  • Chunting Tan; 
  • Li Li; 
  • Xiangdong Mu; 
  • Honglei Liu; 
  • Yuqing Yang

ABSTRACT

Background:

The rapid advancements in natural language processing (NLP), particularly the development of large language models (LLMs), have opened new avenues for managing complex clinical text data. However, the inherent complexity and specificity of medical texts present significant challenges for the practical application of prompt engineering in diagnostic tasks.

Objective:

To address these limitations, this study proposes a novel prompt engineering strategy that integrates feature summarization, chain of thought (CoT) reasoning, and a hybrid retrieval-augmented generation (RAG) framework.

Methods:

A feature summarization approach, leveraging TF-IDF and K-means clustering, was employed to extract and distill key radiological findings. Simultaneously, the hybrid RAG framework combined dense and sparse vector representations to enhance LLMs’ comprehension of disease-related text. The proposed strategy was evaluated using a multicenter dataset containing radiology reports on pneumonia, tuberculosis, and lung cancer, with three state-of-the-art LLMs: GLM-4-plus, GLM-4-air, and GPT-4o.

Results:

Comparative analyses were performed against a BERT-based prediction model and various other prompt engineering techniques. Our strategy achieved superior performance, attaining an accuracy of 0.8947 and an F1 score of 0.8887 on the primary dataset, alongside an accuracy of 0.9167 and an F1 score of 0.8631 on an external validation dataset of radiology reports.

Conclusions:

These findings highlight the potential of LLMs to revolutionize pulmonary disease prediction, particularly in resource-constrained settings, by surpassing traditional models in both accuracy and flexibility. The proposed prompt engineering strategy not only improves predictive performance but also enhances the adaptability of LLMs in complex medical contexts, offering a promising tool for advancing disease diagnosis and clinical decision making.


 Citation

Please cite as:

Li R, Mao S, Zhu C, Yang Y, Tan C, Li L, Mu X, Liu H, Yang Y

Enhancing Pulmonary Disease Prediction Using Large Language Models With Feature Summarization and Hybrid Retrieval-Augmented Generation: Multicenter Methodological Study Based on Radiology Report

J Med Internet Res 2025;27:e72638

DOI: 10.2196/72638

PMID: 40499132

PMCID: 12176309

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.