JMIR Preprints #76947: Evaluating Large Language Models in Ophthalmology: A Systematic Review

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Evaluating Large Language Models in Ophthalmology: A Systematic Review

Zili Zhang;
Haiyang Zhang;
Zhe Pan;
Zhangqian Bi;
Yao Wan;
Xuefei Song;
Xianqun Fan

ABSTRACT

Background:

Large Language Models (LLMs) have the potential to revolutionize ophthalmic care, but their performance evaluations remain fragmented. A systematic assessment is crucial to identify gaps and guide future research and clinical integration.

Objective:

This systematic review aims to characterize the current landscape of LLM performance evaluations in ophthalmology, focusing on model usage, data modalities, ophthalmic subspecialties, medical tasks, evaluation dimensions, and clinical alignment.

Methods:

A comprehensive search of PubMed, Web of Science, Embase, and IEEEXplore was conducted up to November 2024, identifying 817 unique publications. After screening, 165 peer-reviewed studies and preprints quantitatively evaluating existing LLMs in ophthalmic tasks were included. Data extraction, categorization, statistical analysis, and visualization were systematically performed with Python.

Results:

The review revealed a heavy reliance on closed-source LLMs (ChatGPT:98.18%, Gemini:34.55%, Copilot:19.39%), with minimal attention to open-source alternatives (LLaMA:3.03%). Modality-wise, 92.12% of studies focused on text-only tasks, while only 7.88% incorporated image-text evaluations, despite the centrality of imaging in ophthalmology. Subspecialty coverage was highly imbalanced, with comprehensive ophthalmology (35.76%), retina & vitreous (16.36%), and glaucoma (11.52%) dominating, while ocular pathology & oncology and ophthalmic pharmacology were largely neglected. Medical tasks primarily included medical queries (43.03%), standardized examinations (23.03%), and diagnosis formulation (13.94%), with limited exploration of triaging (4.24%) and disease prediction (2.42%). Accuracy (93.33%) was the predominant evaluation metric, while calibration and uncertainty were rarely addressed (2.42%). Clinically, real-world patient data usage (19.39%), non-English evaluations (4.85%), and in-clinic deployment (1.21%) remained critically understudied, highlighting a significant translational gap.

Conclusions:

This review highlights significant gaps in LLM evaluations, including uneven subspecialty coverage, limited multimodal assessments, and insufficient real-world clinical testing. Future research should prioritize standardized frameworks, unified benchmarks, and comprehensive real-world evaluations to ensure LLMs' safe and effective integration into ophthalmic practice, ultimately improving patient outcomes.

Citation

Please cite as:

Zhang Z, Zhang H, Pan Z, Bi Z, Wan Y, Song X, Fan X

Evaluating Large Language Models in Ophthalmology: Systematic Review

J Med Internet Res 2025;27:e76947

DOI: 10.2196/76947

PMID: 41144954

PMCID: 12603593

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: May 5, 2025

Date Accepted: Sep 15, 2025

Evaluating Large Language Models in Ophthalmology: A Systematic Review

ABSTRACT

Citation

Copyright