Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Nov 5, 2025
Open Peer Review Period: Nov 5, 2025 - Dec 31, 2025
Date Accepted: Jan 7, 2026
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Multimodal Large Language Models for Cystoscopic Image Interpretation and Bladder Lesion Classification: Comparative Study

Shih YC, Wu CY, Huang SW, Tsai CY

Multimodal Large Language Models for Cystoscopic Image Interpretation and Bladder Lesion Classification: Comparative Study

J Med Internet Res 2026;28:e87193

DOI: 10.2196/87193

PMID: 41605505

PMCID: 12895159

In-depth Analysis of Multi-Modal Large Language Model in Cystoscopic Image Interpretation and Lesion Classification: Comparative Study

  • Yung-Chi Shih; 
  • Cheng-Yang Wu; 
  • Shi-Wei Huang; 
  • Chung-You Tsai

ABSTRACT

Background:

Cystoscopy remains the gold standard for diagnosing bladder lesions; however, its diagnostic accuracy is operator-dependent and prone to missing subtle abnormalities such as carcinoma in situ or misinterpreting mimic lesions (tumor, inflammation, or normal variants). AI–based image-analysis systems are emerging to enhance lesion detection and diagnostic precision, yet conventional models remain limited to single tasks (e.g., lesion classification or segmentation) and cannot produce explanatory reports or articulate diagnostic reasoning. Multimodal large language models (MM-LLMs) integrate visual recognition, contextual reasoning and language generation, offering interpretive capabilities beyond conventional AI.

Objective:

To rigorously evaluate state-of-the-art MM-LLMs for cystoscopic image interpretation and lesion classification using clinician-defined stress-test datasets enriched with rare, diverse and challenging lesions, focusing on diagnostic accuracy, reasoning quality, and clinical relevance.

Methods:

Four MM-LLMs (OpenAI-o3, ChatGPT-4o, Gemini-2.5-Pro, MedGemma-27B) were evaluated under blinded, randomized procedures across two tasks: (1) free-text image interpretation for anatomic site, findings, lesion reasoning, final diagnosis (n=401); and (2) seven-class tumor-like lesion classification (n=113) within a multiple-choice framework (cystitis, polyps, papilloma, papillary urothelial carcinoma, carcinoma in situ, non-urothelial carcinoma, none-of-the-above). Three raters independently scored outputs using a 5-point Likert scale, and classification metrics (accuracy, sensitivity, specificity, Youden-J, Matthews correlation coefficient [MCC]) were calculated for lesion detection, biopsy indication, and malignancy endpoints. For optimization, model performance was compared between zero-shot and text-based in-context learning (ICL) prompts that prefixed brief tumor-feature descriptions.

Results:

The 401-image test set spanned 40 subcategories, with 322 (80.3%) containing abnormal findings. In the image-interpretation task, inter-rater reliability was excellent (14 of 16 intraclass-correlation coefficients = 0.82–0.94). OpenAI-o3 demonstrated strong reasoning, with high satisfaction for anatomy (84.5%) and findings (76.0%) but lower for lesion reasoning (52.5%) and final diagnosis (48.2%), indicating increasing difficulty with higher-order synthesis. Mean Likert-score differences (o3 minus Gemini-2.5-Pro) were +0.27 for Findings, +0.24 for Lesion reasoning, and +0.19 for Final diagnosis (all p < 0.05). For clinically relevant endpoints in the full set, o3 achieved the most balanced performance (lesion-detection accuracy 88.3%, sensitivity 92.0%, specificity 73.1%; Youden-J 0.650; MCC 0.635). In seven-class tumor-like lesion classification, o3 achieved accuracies 72.3% for biopsy indication, and 63.4% for malignancy, with a balanced sensitivity–specificity trade-off, outperforming others. Notably, o3 performed best on prevalent malignant lesions but struggled with rare entities. ChatGPT-4o and Gemini-2.5-Pro showed high sensitivity but low specificity, whereas MedGemma-27B underperformed. ICL improved o3 micro-average accuracy (41.1→46.0%; MCC 0.313→0.370) but yielded only slightly increased specificity without overall accuracy gain for others, likely constrained by the absence of paired image-text context.

Conclusions:

MM-LLMs demonstrate meaningful assistive potential in generating interpretable cystoscopy free-text rationales, supporting biopsy triage and training. However, performance in difficult differential diagnoses remains modest and requires further optimization before safe clinical integration. Clinical Trial: N/A


 Citation

Please cite as:

Shih YC, Wu CY, Huang SW, Tsai CY

Multimodal Large Language Models for Cystoscopic Image Interpretation and Bladder Lesion Classification: Comparative Study

J Med Internet Res 2026;28:e87193

DOI: 10.2196/87193

PMID: 41605505

PMCID: 12895159

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.