Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Jan 3, 2025
Date Accepted: Jun 20, 2025

The final, peer-reviewed published version of this preprint can be found here:

A Multimodal Large Language Model as an End-to-End Classifier of Thyroid Nodule Malignancy Risk: Usability Study

Sng GGR, Xiang Y, Lim DYZ, Tung JYM, Tan JH, Chng CL

A Multimodal Large Language Model as an End-to-End Classifier of Thyroid Nodule Malignancy Risk: Usability Study

JMIR Form Res 2025;9:e70863

DOI: 10.2196/70863

PMID: 40829145

PMCID: 12364431

A multimodal large language model as an end-to-end classifier of thyroid nodule malignancy risk: Practical or Potential?

  • Gerald Gui Ren Sng; 
  • Yi Xiang; 
  • Daniel Yan Zheng Lim; 
  • Joshua Yi Min Tung; 
  • Jen Hong Tan; 
  • Chiaw Ling Chng

ABSTRACT

Background:

Thyroid nodules are a prevalent problem in the general population. To date, commercial applications of artificial intelligence (AI) solutions for nodule risk classification have used traditional machine-learning models. Large Language Models (LLMs), especially those equipped for multimodal tasks combining text and image data, have shown promise in various applications, including medical diagnostics. Importantly, they can potentially offer flexibility for application in different imaging classification tasks.

Objective:

This study investigates the effectiveness of a multimodal vision-language model in the ultrasound-based risk stratification of thyroid nodules using the ACR TI-RADS risk stratification system, exploring the model's accuracy, consistency, and the influence of prompt engineering.

Methods:

We utilized Microsoft's open-source LLaVA model and its instruction-tuned model LLaVA-Med, to assess 192 thyroid nodules from ultrasound cine-clip images with ACR TI-RADS descriptors. The study involved analyzing the output of the model and the effect of the use of basic and modified prompts, and images with and without radiologist-annotated regions of interest. The analysis measured the accuracy of the LLM outputs against manual assessments, and the consistency of outputs.

Results:

Out of 4,608 responses, 83.3% were deemed valid, with prompt engineering improving frequency of valid responses. The LLaVA-Med model demonstrated higher accuracy in classifying individual TI-RADS components including composition (42.1% vs 20.3%, p<0.001) and echogenicity (57.3% vs 49.9%, p=0.004) compared to the base model, but overall TI-RADS classification accuracy remained low for both models (31.9% vs 38.9%, p=0.004). The use of labelled images improved accuracy in classifying nodule margins (58.2% vs 53.0%, p=0.040). Prompt engineering improved the consistency of the overall TI-RADS classification (52.1% vs 26.6%, p<0.001), but its effect on accuracy varied across different components.

Conclusions:

The study explores the use of open-source, multimodal LLMs as a resource-efficient method of end-to-end thyroid nodule risk stratification, including commonly-employed methods of performance optimization. However, the mixed results highlight the challenges in achieving clinically meaningful performance in their current form. The results suggest that while instruction-tuning and prompt engineering can enhance model output, the inherent technical limitations in image interpretation and model stochasticity restrict their clinical utility. Future developments should build on these findings to explore efficient prompting techniques to improve their accuracy and consistency in clinical applications.


 Citation

Please cite as:

Sng GGR, Xiang Y, Lim DYZ, Tung JYM, Tan JH, Chng CL

A Multimodal Large Language Model as an End-to-End Classifier of Thyroid Nodule Malignancy Risk: Usability Study

JMIR Form Res 2025;9:e70863

DOI: 10.2196/70863

PMID: 40829145

PMCID: 12364431

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.