Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 11, 2024
Open Peer Review Period: Nov 11, 2024 - Jan 6, 2025
Date Accepted: Jan 30, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework

Abdullah M, Kim ST

Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework

JMIR Med Inform 2025;13:e68618

DOI: 10.2196/68618

PMID: 40153539

PMCID: 11970564

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Large Language Models are good Radiology Report Labelers

  • Mr Abdullah; 
  • Seong Tae Kim

ABSTRACT

Background:

Labeling unstructured radiology reports is crucial for creating structured datasets that facilitate downstream tasks, such as training large-scale medical imaging models. Current approaches typically rely on BERT-based methods or manual expert annotations, which have limitations in terms of scalability and performance.

Objective:

To evaluate the effectiveness of a GPT-based large language model (LLM) in labeling radiology reports, comparing it with two existing methods, CheXbert and CheXpert, on a large chest X-ray dataset (MIMIC-CXR).

Methods:

In this study, we introduce an LLM-based approach fine-tuned on expert-labeled radiology reports. Our model's performance was evaluated on 687 radiologist-labeled chest X-ray reports, comparing F1 scores across 14 thoracic pathologies. The performance of our LLM model was compared with the CheXbert and CheXpert models across positive, negative, and uncertainty extraction tasks. Paired t-tests and Wilcoxon signed-rank tests were performed to evaluate the statistical significance of differences between model performances.

Results:

Our GPT-based model achieved an average F1 score of 0.9014 for all certainty levels and 0.8708 for positive/negative certainty levels, outperforming CheXpert (F1 scores of 0.8864 and 0.8525, respectively) and performing comparably to CheXbert (F1 scores of 0.9047 and 0.8733, respectively). Paired t-tests revealed no statistically significant difference between our model and CheXbert (P = 0.3483), but a significant difference between our model and CheXpert (P = 0.0114). The Wilcoxon test also confirmed these results: no significant difference between our model and CheXbert (P = 0.1353) and a significant difference between our model and CheXpert (P = 0.0052).

Conclusions:

The GPT-based LLM model demonstrates competitive performance compared to CheXbert and outperforms CheXpert in radiology report labeling. These findings suggest that LLMs are a promising alternative to traditional BERT-based architectures for this task, offering enhanced context understanding and eliminating the need for extensive feature engineering. Moreover with large context length LLM based models are better suited for this task as compared to the small context length of BERT based models. Clinical Trial: Not applicable.


 Citation

Please cite as:

Abdullah M, Kim ST

Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework

JMIR Med Inform 2025;13:e68618

DOI: 10.2196/68618

PMID: 40153539

PMCID: 11970564

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.