Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: May 5, 2025
Date Accepted: Sep 15, 2025

The final, peer-reviewed published version of this preprint can be found here:

Evaluating Large Language Models in Ophthalmology: Systematic Review

Zhang Z, Zhang H, Pan Z, Bi Z, Wan Y, Song X, Fan X

Evaluating Large Language Models in Ophthalmology: Systematic Review

J Med Internet Res 2025;27:e76947

DOI: 10.2196/76947

PMID: 41144954

PMCID: 12603593

Evaluating Large Language Models in Ophthalmology: A Systematic Review

  • Zili Zhang; 
  • Haiyang Zhang; 
  • Zhe Pan; 
  • Zhangqian Bi; 
  • Yao Wan; 
  • Xuefei Song; 
  • Xianqun Fan

ABSTRACT

Background:

Large Language Models (LLMs) have the potential to revolutionize ophthalmic care, but their performance evaluations remain fragmented. A systematic assessment is crucial to identify gaps and guide future research and clinical integration.

Objective:

This systematic review aims to characterize the current landscape of LLM performance evaluations in ophthalmology, focusing on model usage, data modalities, ophthalmic subspecialties, medical tasks, evaluation dimensions, and clinical alignment.

Methods:

A comprehensive search of PubMed, Web of Science, Embase, and IEEEXplore was conducted up to November 2024, identifying 817 unique publications. After screening, 165 peer-reviewed studies and preprints quantitatively evaluating existing LLMs in ophthalmic tasks were included. Data extraction, categorization, statistical analysis, and visualization were systematically performed with Python.

Results:

The review revealed a heavy reliance on closed-source LLMs (ChatGPT:98.18%, Gemini:34.55%, Copilot:19.39%), with minimal attention to open-source alternatives (LLaMA:3.03%). Modality-wise, 92.12% of studies focused on text-only tasks, while only 7.88% incorporated image-text evaluations, despite the centrality of imaging in ophthalmology. Subspecialty coverage was highly imbalanced, with comprehensive ophthalmology (35.76%), retina & vitreous (16.36%), and glaucoma (11.52%) dominating, while ocular pathology & oncology and ophthalmic pharmacology were largely neglected. Medical tasks primarily included medical queries (43.03%), standardized examinations (23.03%), and diagnosis formulation (13.94%), with limited exploration of triaging (4.24%) and disease prediction (2.42%). Accuracy (93.33%) was the predominant evaluation metric, while calibration and uncertainty were rarely addressed (2.42%). Clinically, real-world patient data usage (19.39%), non-English evaluations (4.85%), and in-clinic deployment (1.21%) remained critically understudied, highlighting a significant translational gap.

Conclusions:

This review highlights significant gaps in LLM evaluations, including uneven subspecialty coverage, limited multimodal assessments, and insufficient real-world clinical testing. Future research should prioritize standardized frameworks, unified benchmarks, and comprehensive real-world evaluations to ensure LLMs' safe and effective integration into ophthalmic practice, ultimately improving patient outcomes.


 Citation

Please cite as:

Zhang Z, Zhang H, Pan Z, Bi Z, Wan Y, Song X, Fan X

Evaluating Large Language Models in Ophthalmology: Systematic Review

J Med Internet Res 2025;27:e76947

DOI: 10.2196/76947

PMID: 41144954

PMCID: 12603593

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.