Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Cancer

Date Submitted: Feb 11, 2024
Date Accepted: Dec 18, 2024

The final, peer-reviewed published version of this preprint can be found here:

Large Language Model Approach for Zero-Shot Information Extraction and Clustering of Japanese Radiology Reports: Algorithm Development and Validation

Yamagishi Y, Nakamura Y, Hanaoka S, Abe O

Large Language Model Approach for Zero-Shot Information Extraction and Clustering of Japanese Radiology Reports: Algorithm Development and Validation

JMIR Cancer 2025;11:e57275

DOI: 10.2196/57275

PMID: 39864093

Zero-shot Information Extraction and Clustering of Japanese Radiology Reports: Validating A Large Language Model Approach

  • Yosuke Yamagishi; 
  • Yuta Nakamura; 
  • Shouhei Hanaoka; 
  • Osamu Abe

ABSTRACT

Background:

In medicine, the application of natural language processing (NLP) to tasks, such as information extraction and classification, has increased significantly. NLP plays a crucial role in structuring free-form radiology reports, facilitating the interpretation of textual content, and enhancing data utility through clustering techniques. Clustering allows for the identification of similar lesions and disease patterns across a broad dataset, making it useful for aggregating information and discovering new insights in medical imaging. However, most publicly available medical datasets are in English, with limited availability in other languages. This scarcity of datasets poses a challenge for developing models geared towards non-English downstream tasks.

Objective:

This study aimed to develop and evaluate an algorithm that utilizes Large Language Models (LLMs) to extract information from Japanese lung cancer radiology reports and perform a clustering analysis. The effectiveness of this approach was assessed and compared with previous supervised methods.

Methods:

This study utilized the MedTxt-RR dataset, comprising 135 Japanese radiology reports from nine radiologists who interpreted the CT images of 15 lung cancer cases from Radiopaedia. Previously employed in the NTCIR-16 shared task for the clustering performance competition, this dataset was ideal for comparing the clustering ability of our algorithm with those of previous methods. It is divided into eight cases for development and seven for testing. The study’s approach involved using an LLM to extract information pertinent to lung cancer findings and then transforming them into numeric features for clustering using the K-means method. Performance was evaluated using 135 reports for information extraction accuracy and 63 test reports for clustering performance. This study focused on the accuracy of automated systems for extracting tumor size, location, and laterality from clinical reports. The clustering performance was evaluated using Normalized Mutual Information (NMI), Adjusted Mutual Information (AMI), and the Fowlkes-Mallows index (FM) for both the development and test data.

Results:

The tumor size was correctly identified in 99 out of 135 reports (73.3%), with errors in 36 reports (26.7%) of the cases, primarily due to missing or incorrect size information. The accuracy was higher for tumor location and laterality, with correct identification in 112 out of 135 reports (83.0%); however, 23 reports (17.0%) contained errors, mainly due to empty values or incorrect data. For the clustering performance of the test data, an NMI of 0.6414, an AMI of 0.5598, and an FM of 0.5354 were recorded. The proposed method demonstrated superior performance in all evaluation metrics compared to previous methods.

Conclusions:

The unsupervised LLM method surpassed the existing supervised methods in clustering Japanese radiology reports. These results suggested that LLMs are promising for extracting information from radiology reports and integrating it into disease-specific knowledge.


 Citation

Please cite as:

Yamagishi Y, Nakamura Y, Hanaoka S, Abe O

Large Language Model Approach for Zero-Shot Information Extraction and Clustering of Japanese Radiology Reports: Algorithm Development and Validation

JMIR Cancer 2025;11:e57275

DOI: 10.2196/57275

PMID: 39864093

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.