Currently accepted at: Journal of Medical Internet Research
Date Submitted: Oct 23, 2025
Date Accepted: Feb 28, 2026
Date Submitted to PubMed: Mar 1, 2026
This paper has been accepted and is currently in production.
It will appear shortly on 10.2196/86365
The final accepted version (not copyedited yet) is in this tab.
An "ahead-of-print" version has been submitted to Pubmed, see PMID: 41764068
Context-Aware Sentence Classification of Radiology Reports Using Synthetic Data: Development and Validation Study
ABSTRACT
Background:
Vision-language models (VLMs) for radiology require large-scale image–text pairs. However, free-text reports mix background information, findings, and continuation sentences. Manual annotation is labor-intensive, and the direct use of clinical reports raises privacy concerns.
Objective:
We aimed to develop a context-aware sentence classification model for Japanese radiology reports using synthetic and automatically annotated data and validate it using multi-institutional clinical reports.
Methods:
Synthetic Japanese radiology reports were generated using OpenAI API (GPT-4.1); sentence-level annotations were performed using GPT-4.1-mini in four categories: context, positive findings, negative findings, and continuation. After filtering, 3,104 reports were divided into training (2,670), validation (334), and testing (100) sets. For external validation, 280 reports dated October 1, 2024, were sampled from seven institutions in the Japan Medical Image Database and annotated by two radiologists. Large language models (Qwen3 and LLaMA 3.2) and Japanese text classification models (BERT base Japanese and ModernBERT-Ja-130M, JMedRoBERTa) were fine-tuned and evaluated for accuracy, macro–F1, and positive predictive value for label 1 (PPV_1).
Results:
For the internal test set (1,124 sentences), all models performed well: accuracy, 0.939–0.950; macroF1, 0.924–0.940; and PPV_1, 0.904–0.953. For the external dataset (3,477 sentences), the accuracy declined to 0.783–0.812 and macro–F1 to 0.761–0.790. Qwen3-4B showed the best performance (PPV_1 = 0.952).
Conclusions:
The model trained solely on synthetic reports showed robust performance in real-world Japanese radiology reports. This approach enables the efficient extraction of finding-level sentences and supports the large-scale construction of image–text pairs for Japanese VLM development.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.