JMIR Preprints #76384: Detecting Laterality Errors in Combined Radiographic Studies: Enhancing Traditional Approach with GPT-4o and Multi-Site Internal Validation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Detecting Laterality Errors in Combined Radiographic Studies: Enhancing Traditional Approach with GPT-4o and Multi-Site Internal Validation

Kung-Hsun Weng;
Yi-Chen Chou;
Yu-Ting Kuo;
Tsyh-Jyi Hsieh;
Chung-Feng Liu

ABSTRACT

Background:

Laterality errors in radiology reports can endanger patient safety. Effective methods for screening for laterality errors in combined radiographic reports, which combine multiple studies into one, remain unexplored.

Objective:

First, we define and analyze the unstudied combined radiographic report format and its challenges. Second, we introduce a clinically deployable ensemble method (rule-based + GPT-4o), evaluated on large-scale, real-world, imbalanced data. Third, we demonstrate significant performance gaps between real-world imbalanced and synthetic balanced datasets, highlighting limitations of the benchmarking methodology commonly used in current studies.

Methods:

This retrospective study analyzed de-identified English radiology reports containing laterality terms in order. We split the data into TrainVal and Test-1 (both real-world, imbalanced), and Test-2 (synthetic, balanced). Test-1 comes from a distinct branch. Experiment 1 compared the baseline, workaround, and GPT-4o-augmented rule-based methods. Experiment 2 compared the rule-based method with the highest recall to fine-tuned RoBERTa, ClinicalBERT, and GPT-4o models.

Results:

As of July 2024, our dataset included 10,000 real-world and 889 synthetic radiology reports. The laterality error rate in real-world reports was 1.2%, significantly higher in combined (1.47%) than in non-combined reports (0.57%) (difference = 0.90%, Z = 3.81, p < .001). Experiment 1: Recall differed significantly among the three versions of rule-based methods (Q = 6.0, p = .0498, Friedman test). The rule-based+GPT-4o method had the highest recall (average rank = 1), significantly better than the baseline (average rank = 3, p = .04, Nemenyi test). Most (5 out of 6) of the false positives introduced by the GPT-4o information extraction were due to parser limitations hidden by error cancellation. Experiment 2: On Test-1, rule-based+GPT-4o (precision: 0.696, recall: 0.889, F1 score: 0.780) outperformed GPT-4o (precision: 0.219, recall: 0.889, F1 score: 0.352), ClinicalBERT (precision: 0.047, recall: 0.667, F1 score: 0.088), and RoBERTa (F1 score: 0.000). On Test-2, rule-based+GPT-4o (precision: 0.996, recall: 0.925, F1: 0.959) and GPT-4o (precision: 0.979, recall: 0.953, F1 score: 0.966) outperformed ClinicalBERT (precision: 0.984, recall: 0.749, F1 score: 0.851) and RoBERTa (F1 score: 0.013). Both ClinicalBERT and GPT-4o exhibited notable declines in precision on TrainVal and Test-1 compared to Test-2. Both Test-1 data membership (GPT-4o: OR 239.89; 95% CI:111.05–518.01; p<.001; ClinicalBERT: OR 1924.07; 95% CI: 687.46–5383.99; p < .001) and order count per study (GPT-4o: OR 1.79; 95% CI: 1.38–2.31, p<.001; ClinicalBERT: OR 2.50; 95% CI: 1.64–3.80; p < .001) independently predicted false positive errors in multivariate logistic regression. In subgroup analysis, all models showed reduced precision and F1 in combined-study subgroups.

Conclusions:

The combined radiographic report format poses distinct challenges for both radiology report quality assurance and natural language processing. The combined rule-based and GPT-4o method effectively screens for laterality errors in imbalanced real-world reports. A significant performance gap exists between balanced synthetic datasets and imbalanced real-world data. Future studies should also include real-world imbalanced data.

Citation

Please cite as:

Weng KH, Chou YC, Kuo YT, Hsieh TJ, Liu CF

Detecting Laterality Errors in Combined Radiographic Studies by Enhancing the Traditional Approach With GPT-4o: Algorithm Development and Multisite Internal Validation

JMIR Form Res 2025;9:e76384

DOI: 10.2196/76384

PMID: 41161340

PMCID: 12612642

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Apr 22, 2025

Date Accepted: Oct 2, 2025

Detecting Laterality Errors in Combined Radiographic Studies: Enhancing Traditional Approach with GPT-4o and Multi-Site Internal Validation

ABSTRACT

Citation

Copyright