Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Apr 22, 2025
Date Accepted: Oct 2, 2025

The final, peer-reviewed published version of this preprint can be found here:

Detecting Laterality Errors in Combined Radiographic Studies by Enhancing the Traditional Approach With GPT-4o: Algorithm Development and Multisite Internal Validation

Weng KH, Chou YC, Kuo YT, Hsieh TJ, Liu CF

Detecting Laterality Errors in Combined Radiographic Studies by Enhancing the Traditional Approach With GPT-4o: Algorithm Development and Multisite Internal Validation

JMIR Form Res 2025;9:e76384

DOI: 10.2196/76384

PMID: 41161340

PMCID: 12612642

Detecting Laterality Errors in Combined Radiographic Studies: Enhancing Traditional Approach with GPT-4o and Multi-Site Internal Validation

  • Kung-Hsun Weng; 
  • Yi-Chen Chou; 
  • Yu-Ting Kuo; 
  • Tsyh-Jyi Hsieh; 
  • Chung-Feng Liu

ABSTRACT

Background:

Laterality errors in radiology reports can endanger patient safety. Effective methods for screening for laterality errors in combined radiographic reports, which combine multiple studies into one, remain unexplored.

Objective:

First, we define and analyze the unstudied combined radiographic report format and its challenges. Second, we introduce a clinically deployable ensemble method (rule-based + GPT-4o), evaluated on large-scale, real-world, imbalanced data. Third, we demonstrate significant performance gaps between real-world imbalanced and synthetic balanced datasets, highlighting limitations of the benchmarking methodology commonly used in current studies.

Methods:

This retrospective study analyzed de-identified English radiology reports containing laterality terms in order. We split the data into TrainVal and Test-1 (both real-world, imbalanced), and Test-2 (synthetic, balanced). Test-1 comes from a distinct branch. Experiment 1 compared the baseline, workaround, and GPT-4o-augmented rule-based methods. Experiment 2 compared the rule-based method with the highest recall to fine-tuned RoBERTa, ClinicalBERT, and GPT-4o models.

Results:

As of July 2024, our dataset included 10,000 real-world and 889 synthetic radiology reports. The laterality error rate in real-world reports was 1.2%, significantly higher in combined (1.47%) than in non-combined reports (0.57%) (difference = 0.90%, Z = 3.81, p < .001). Experiment 1: Recall differed significantly among the three versions of rule-based methods (Q = 6.0, p = .0498, Friedman test). The rule-based+GPT-4o method had the highest recall (average rank = 1), significantly better than the baseline (average rank = 3, p = .04, Nemenyi test). Most (5 out of 6) of the false positives introduced by the GPT-4o information extraction were due to parser limitations hidden by error cancellation. Experiment 2: On Test-1, rule-based+GPT-4o (precision: 0.696, recall: 0.889, F1 score: 0.780) outperformed GPT-4o (precision: 0.219, recall: 0.889, F1 score: 0.352), ClinicalBERT (precision: 0.047, recall: 0.667, F1 score: 0.088), and RoBERTa (F1 score: 0.000). On Test-2, rule-based+GPT-4o (precision: 0.996, recall: 0.925, F1: 0.959) and GPT-4o (precision: 0.979, recall: 0.953, F1 score: 0.966) outperformed ClinicalBERT (precision: 0.984, recall: 0.749, F1 score: 0.851) and RoBERTa (F1 score: 0.013). Both ClinicalBERT and GPT-4o exhibited notable declines in precision on TrainVal and Test-1 compared to Test-2. Both Test-1 data membership (GPT-4o: OR 239.89; 95% CI:111.05–518.01; p<.001; ClinicalBERT: OR 1924.07; 95% CI: 687.46–5383.99; p < .001) and order count per study (GPT-4o: OR 1.79; 95% CI: 1.38–2.31, p<.001; ClinicalBERT: OR 2.50; 95% CI: 1.64–3.80; p < .001) independently predicted false positive errors in multivariate logistic regression. In subgroup analysis, all models showed reduced precision and F1 in combined-study subgroups.

Conclusions:

The combined radiographic report format poses distinct challenges for both radiology report quality assurance and natural language processing. The combined rule-based and GPT-4o method effectively screens for laterality errors in imbalanced real-world reports. A significant performance gap exists between balanced synthetic datasets and imbalanced real-world data. Future studies should also include real-world imbalanced data.


 Citation

Please cite as:

Weng KH, Chou YC, Kuo YT, Hsieh TJ, Liu CF

Detecting Laterality Errors in Combined Radiographic Studies by Enhancing the Traditional Approach With GPT-4o: Algorithm Development and Multisite Internal Validation

JMIR Form Res 2025;9:e76384

DOI: 10.2196/76384

PMID: 41161340

PMCID: 12612642

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.