Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Aug 19, 2024
Open Peer Review Period: Aug 19, 2024 - Oct 14, 2024
Date Accepted: Mar 13, 2025
(closed for review but you can still tweet)
Large Language Models in Summarizing Radiology Report Impressions for Lung Cancer in Chinese: Evaluation Study
ABSTRACT
Background:
Large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in various natural language processing tasks, particularly in text generation. However, their effectiveness in summarizing radiology report impressions remains uncertain.
Objective:
This study aims to evaluate the capability of nine LLMs, i.e., Tongyi Qianwen, ERNIE Bot, ChatGPT, Bard, Claude, Baichuan, ChatGLM, HuatuoGPT, and ChatGLM-Med, in summarizing radiology report impressions.
Methods:
We collected three types of radiology reports, i.e., CT, PET-CT, and Ultrasound reports, from Peking University Cancer Hospital and Institute. Using these reports, we created zero-shot, one-shot, and three-shot prompts with or without complete example reports as inputs to generate impressions. We employed both automatic quantitative evaluation metrics and five human evaluation metrics (completeness, correctness, conciseness, verisimilitude, and replaceability) to assess the generated impressions. Two thoracic surgeons (ZSY and LB) and one radiologist (LQ) compared the generated impressions with reference impressions, scoring them according to the five human evaluation metrics.
Results:
In the automatic quantitative evaluation, ERNIE Bot, Tongyi Qianwen, and ChatGPT demonstrated the best overall performance in generating impressions for CT, PET-CT, and Ultrasound reports, respectively. In the human semantic evaluation, ERNIE Bot outperformed the other LLMs in terms of conciseness, verisimilitude, and replaceability on CT impression generation, while its completeness and correctness scores were comparable to those of other LLMs. Tongyi Qianwen excelled in PET-CT impression generation, with the highest scores for correctness, conciseness, verisimilitude, and replaceability. Claude achieved the best conciseness, verisimilitude, and replaceability scores on US impression generation, and its completeness and correctness scores are close to the best results obtained by other LLMs. The generated impressions were generally complete and correct but lacked conciseness and verisimilitude. Although one-shot and few-shot prompts improved conciseness and verisimilitude, clinicians noted a significant gap between the generated impressions and those written by radiologists.
Conclusions:
Current large language models can produce radiology impressions with high completeness and correctness but fall short in conciseness and verisimilitude, indicating they cannot yet fully replace impressions written by radiologists.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.