Accepted for/Published in: JMIR Formative Research
Date Submitted: Jan 3, 2025
Date Accepted: Jun 20, 2025
A multimodal large language model as an end-to-end classifier of thyroid nodule malignancy risk: Practical or Potential?
ABSTRACT
Background:
Thyroid nodules are a prevalent problem in the general population. To date, commercial applications of artificial intelligence (AI) solutions for nodule risk classification have used traditional machine-learning models. Large Language Models (LLMs), especially those equipped for multimodal tasks combining text and image data, have shown promise in various applications, including medical diagnostics. Importantly, they can potentially offer flexibility for application in different imaging classification tasks.
Objective:
This study investigates the effectiveness of a multimodal vision-language model in the ultrasound-based risk stratification of thyroid nodules using the ACR TI-RADS risk stratification system, exploring the model's accuracy, consistency, and the influence of prompt engineering.
Methods:
We utilized Microsoft's open-source LLaVA model and its instruction-tuned model LLaVA-Med, to assess 192 thyroid nodules from ultrasound cine-clip images with ACR TI-RADS descriptors. The study involved analyzing the output of the model and the effect of the use of basic and modified prompts, and images with and without radiologist-annotated regions of interest. The analysis measured the accuracy of the LLM outputs against manual assessments, and the consistency of outputs.
Results:
Out of 4,608 responses, 83.3% were deemed valid, with prompt engineering improving frequency of valid responses. The LLaVA-Med model demonstrated higher accuracy in classifying individual TI-RADS components including composition (42.1% vs 20.3%, p<0.001) and echogenicity (57.3% vs 49.9%, p=0.004) compared to the base model, but overall TI-RADS classification accuracy remained low for both models (31.9% vs 38.9%, p=0.004). The use of labelled images improved accuracy in classifying nodule margins (58.2% vs 53.0%, p=0.040). Prompt engineering improved the consistency of the overall TI-RADS classification (52.1% vs 26.6%, p<0.001), but its effect on accuracy varied across different components.
Conclusions:
The study explores the use of open-source, multimodal LLMs as a resource-efficient method of end-to-end thyroid nodule risk stratification, including commonly-employed methods of performance optimization. However, the mixed results highlight the challenges in achieving clinically meaningful performance in their current form. The results suggest that while instruction-tuning and prompt engineering can enhance model output, the inherent technical limitations in image interpretation and model stochasticity restrict their clinical utility. Future developments should build on these findings to explore efficient prompting techniques to improve their accuracy and consistency in clinical applications.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.