Currently submitted to: JMIR Formative Research
Date Submitted: May 9, 2026
Open Peer Review Period: Jun 4, 2026 - Jul 30, 2026
(currently open for review)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
From GitHub to Clinic? Small-Scale Validation of Open-Source AI for Diabetic Retinopathy Screening
ABSTRACT
Background:
Accurate, timely detection of diabetic retinopathy (DR) remains challenging, especially in low‑resource settings with limited ophthalmologic expertise. Open‑source deep learning models offer accessible computer vision tools for DR screening, but their real‑world reliability and clinical readiness are not well established.
Objective:
This study evaluates two open‑source models—Proliferative Retinopathy Detection and Bhimrazy Diabetic‑Retinopathy‑Detection—on a publicly available Zenodo fundus image dataset to compare performance characteristics and explore potential clinical roles.
Methods:
We randomly sampled 100 fundus photographs (50 DR‑positive, 50 DR‑negative) from the Zenodo dataset. Each model generated binary predictions (DR vs. no DR) and confidence scores. Using custom Python scripts, we computed sensitivity, specificity, precision, F1 score, accuracy, and average confidence for correctly identified DR‑positive cases. McNemar's test was used to compare paired classification performance between models.
Results:
The Proliferative model achieved 81% accuracy, with precision of 86% and specificity of 0.88, indicating relatively few false positives and greater reliability for ruling out non‑DR cases. Its F1 score was 0.794, with a mean confidence of 78.2% for correctly identified DR‑positive images. The Bhimrazy model demonstrated higher sensitivity (0.88), detecting more DR‑positive cases, but with lower specificity (0.80) and similar precision (81%), reflecting a modest increase in false positives; its F1 score and accuracy were 0.845 and 0.84, respectively. McNemar's test showed no statistically significant difference in overall classification performance between models (χ² = 0.64, p = 0.42).
Conclusions:
In this small, single‑dataset evaluation, the Proliferative model showed relatively higher specificity and precision, while the Bhimrazy model demonstrated relatively higher sensitivity, yielding complementary operating characteristics rather than clear superiority of one approach. These findings should be viewed as preliminary and hypothesis‑generating. Larger, multi‑center and prospective validations, including assessment across DR severity thresholds and comparison with ophthalmologist performance, are needed before considering any clinical deployment of these open‑source tools. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.