Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 4, 2026
Date Accepted: May 12, 2026
Performance of Deep Learning in Classifying Age-related Macular Degeneration from Images: A Systematic Review and Meta-analysis
ABSTRACT
Background:
Age-related macular degeneration (AMD) is a leading cause of irreversible blindness worldwide. Manual grading is time-consuming and subject to interobserver variability, underscoring the need to assess automated artificial intelligence tools as potential diagnostic aids.
Objective:
This study evaluates the comparative diagnostic performance of deep learning (DL) algorithms versus ophthalmologists (senior and junior) for detecting AMD and differentiating wet AMD (wAMD) from dry AMD (dAMD) using retinal images.
Methods:
A systematic search of PubMed, Embase, Web of Science, and the Cochrane Library was conducted through October 2025. Studies applying DL for AMD classification were included. Risk of bias was evaluated using PROBAST+AI. Pooled sensitivity, specificity, accuracy, and area under the curve (AUC) were calculated using bivariate random-effects models. The protocol was registered in PROSPERO (CRD420251243276).
Results:
Twenty-eight studies were included, comprising 77,485 samples for AMD detection and 28,705 samples for wAMD versus dAMD classification. For AMD detection, DL achieved pooled sensitivity of 0.98 (95% CI, 0.96–0.99; I²=99.59%), specificity 0.98 (95% CI, 0.95–0.99; I²=99.43%), accuracy 0.97 (95% CI, 0.96–0.99; I²=96.0%), and AUC 1.00 (95% CI, 0.99–1.00). Sensitivity and accuracy were significantly higher than those of senior ophthalmologists (both P<0.001). For wAMD versus dAMD, DL demonstrated sensitivity 0.95 (95% CI, 0.91–0.97; I²=97.90%), specificity 0.95 (95% CI, 0.93–0.97; I²=90.13%), accuracy 0.95 (95% CI, 0.92–0.97; I²=97.0%), and AUC 0.99 (95% CI, 0.97–0.99). DL surpassed senior ophthalmologists in sensitivity (0.95 vs. 0.67; P=0.009) and outperformed junior ophthalmologists in specificity (0.95 vs. 0.53; P<0.001) and accuracy (0.95 vs. 0.75; P<0.001). Optical coherence tomography–based models showed superior performance compared with fundus photography. Certainty of evidence was moderate due to study bias.
Conclusions:
DL algorithms demonstrate robust diagnostic performance, effectively mitigating the conservative bias of senior experts and the over-diagnosis tendency of junior practitioners. While OCT-based models emerged as the optimal modality, substantial heterogeneity and retrospective designs in the included studies warrant caution. Future large-scale, multi-center prospective trials are essential to bridge the generalization gap. Clinical Trial: The protocol was registered in PROSPERO (CRD420251243276)
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.