JMIR Preprints #71137: Development of a Large-scale Dataset of Chest Computed Tomography Reports in Japanese and a High-performance Finding Classification Model: An Annotated Dataset and a Deep Neural Network Classifier

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Development of a Large-scale Dataset of Chest Computed Tomography Reports in Japanese and a High-performance Finding Classification Model: An Annotated Dataset and a Deep Neural Network Classifier

Yosuke Yamagishi;
Yuta Nakamura;
Tomohiro Kikuchi;
Yuki Sonoda;
Hiroshi Hirakawa;
Shintaro Kano;
Satoshi Nakamura;
Shouhei Hanaoka;
Takeharu Yoshikawa;
Osamu Abe

ABSTRACT

Background:

Recent advances in large language models have highlighted the need for high-quality multilingual medical datasets. Although Japan is a global leader in computed tomography (CT) scanner deployment and utilization, the absence of large-scale Japanese radiology datasets has hindered the development of specialized language models for medical imaging analysis. Despite the emergence of multilingual models and language-specific adaptations, the development of Japanese-specific medical language models has been constrained by a lack of comprehensive datasets, particularly in radiology.

Objective:

To address this critical gap in Japanese medical natural language processing resources, a comprehensive Japanese CT report dataset was developed through machine translation, to establish a specialized language model for structured classification. Additionally, a rigorously validated evaluation dataset was created through expert radiologist refinement to ensure a reliable assessment of model performance.

Methods:

We translated the CT-RATE dataset (24,283 CT reports from 21,304 patients) into Japanese using GPT-4o mini. The training dataset consisted of 22,778 machine-translated reports, and the validation dataset included 150 reports carefully revised by radiologists. We developed CT-BERT-JPN (Japanese), a specialized BERT model, thereby extracting 18 structured findings from Japanese radiology reports using the "tohoku-nlp/bert-base-japanese-v3" architecture. Translated radiology reports were assessed using Bilingual Evaluation Understudy (BLEU) and Recall-Oriented Understudy for Gisting Evaluation (ROUGE) scores and complemented by an expert radiologist’s review. Model performance was evaluated using standard metrics, including accuracy, precision, recall, F1 score, and area under the receiver operating characteristic curve, with GPT-4o serving as the baseline.

Results:

General text structure was preserved as indicated by BLEU scores of 0.731 and 0.690 and ROUGE scores ranging from 0.770 to 0.876 for findings and 0.748 to 0.857 for impression. Expert review suggested refinements in medical terminology. These modifications fell into three categories–contextual refinement of technical terms, completion of incomplete translations, and Japanese localization of medical terminology–highlighting the importance of expert validation in medical translations. CT-BERT-JPN demonstrated superior performance compared with GPT-4o in 11 of the 18 conditions, including lymphadenopathy (+14.2%), interlobular septal thickening (+10.9%), and atelectasis (+7.4%). The model achieved perfect scores in four conditions (cardiomegaly, hiatal hernia, atelectasis, and interlobular septal thickening), and the F1 score exceeded 0.95 in 14 out of 18 conditions. The performance remained robust despite varying the number of positive samples across conditions (ranging from 7 to 82 cases).

Conclusions:

Our study established a robust Japanese CT report dataset and demonstrated the effectiveness of a specialized language model in structured classification of findings. This hybrid approach of machine translation and expert validation enabled the creation of large-scale datasets while maintaining high-quality standards. This study provides essential resources for advancing medical AI research in Japanese healthcare settings, usings datasets and models publicly available for research to facilitate further advancement in the field.

Citation

Please cite as:

Yamagishi Y, Nakamura Y, Kikuchi T, Sonoda Y, Hirakawa H, Kano S, Nakamura S, Hanaoka S, Yoshikawa T, Abe O

Development of a Large-Scale Dataset of Chest Computed Tomography Reports in Japanese and a High-Performance Finding Classification Model: Dataset Development and Validation Study

JMIR Med Inform 2025;13:e71137

DOI: 10.2196/71137

PMID: 40874833

PMCID: 12392688

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jan 10, 2025

Date Accepted: Jul 15, 2025

Development of a Large-scale Dataset of Chest Computed Tomography Reports in Japanese and a High-performance Finding Classification Model: An Annotated Dataset and a Deep Neural Network Classifier

ABSTRACT

Citation

Copyright