JMIR Preprints #68618: Large Language Models are good Radiology Report Labelers

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Large Language Models are good Radiology Report Labelers

Mr Abdullah;
Seong Tae Kim

ABSTRACT

Background:

Labeling unstructured radiology reports is crucial for creating structured datasets that facilitate downstream tasks, such as training large-scale medical imaging models. Current approaches typically rely on BERT-based methods or manual expert annotations, which have limitations in terms of scalability and performance.

Objective:

To evaluate the effectiveness of a GPT-based large language model (LLM) in labeling radiology reports, comparing it with two existing methods, CheXbert and CheXpert, on a large chest X-ray dataset (MIMIC-CXR).

Methods:

In this study, we introduce an LLM-based approach fine-tuned on expert-labeled radiology reports. Our model's performance was evaluated on 687 radiologist-labeled chest X-ray reports, comparing F1 scores across 14 thoracic pathologies. The performance of our LLM model was compared with the CheXbert and CheXpert models across positive, negative, and uncertainty extraction tasks. Paired t-tests and Wilcoxon signed-rank tests were performed to evaluate the statistical significance of differences between model performances.

Results:

Our GPT-based model achieved an average F1 score of 0.9014 for all certainty levels and 0.8708 for positive/negative certainty levels, outperforming CheXpert (F1 scores of 0.8864 and 0.8525, respectively) and performing comparably to CheXbert (F1 scores of 0.9047 and 0.8733, respectively). Paired t-tests revealed no statistically significant difference between our model and CheXbert (P = 0.3483), but a significant difference between our model and CheXpert (P = 0.0114). The Wilcoxon test also confirmed these results: no significant difference between our model and CheXbert (P = 0.1353) and a significant difference between our model and CheXpert (P = 0.0052).

Conclusions:

The GPT-based LLM model demonstrates competitive performance compared to CheXbert and outperforms CheXpert in radiology report labeling. These findings suggest that LLMs are a promising alternative to traditional BERT-based architectures for this task, offering enhanced context understanding and eliminating the need for extensive feature engineering. Moreover with large context length LLM based models are better suited for this task as compared to the small context length of BERT based models. Clinical Trial: Not applicable.

Citation

Please cite as:

Abdullah M, Kim ST

Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework

JMIR Med Inform 2025;13:e68618

DOI: 10.2196/68618

PMID: 40153539

PMCID: 11970564

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 11, 2024

Open Peer Review Period: Nov 11, 2024 - Jan 6, 2025

Date Accepted: Jan 30, 2025

(closed for review but you can still tweet)

Large Language Models are good Radiology Report Labelers

ABSTRACT

Citation

Copyright