Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: May 5, 2024
Date Accepted: Oct 15, 2024
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Performance of the Generative Artificial Intelligence Chatbot in Ophthalmic Registration and Clinical Diagnosis: a Cross-sectional Study
ABSTRACT
Background:
Artificial Intelligence (AI) chatbots like ChatGPT are expected to impact vision healthcare significantly. Their potential to optimize the consultation process and diagnostic capabilities across range of ophthalmic sub-specialties remain to be fully explored.
Objective:
To investigate the performance of AI chatbots in recommending ophthalmic outpatient registration and in diagnosing eye diseases within clinical case profiles.
Methods:
This cross-sectional study utilized clinical cases from the Chinese Standardized Resident Training (SRT) - Ophthalmology (2nd Edition). For each case, two profiles were created: “Patient with History” (Hx) and “Patient with History and Examination” (Hx and Ex). These profiles served as independent queries for ChatGPT-3.5 and 4.0 (accessed from March 5-18, 2024). Similarly, three ophthalmic residents were posed the same profiles in a questionnaire format. The accuracy of recommending ophthalmic sub-specialty registration was primarily evaluated using “Hx” profiles. The accuracy of the top-ranked diagnosis, and the accuracy of diagnosis within the top three suggestions (do-not-miss diagnosis), were assessed using “Hx and Ex” profiles. The gold standard for judgment was the published official diagnosis. Characteristics of incorrect diagnoses by ChatGPT were also analyzed.
Results:
A total of 208 clinical profiles from 12 ophthalmic sub-specialties were analyzed (104 “Hx” and 104 “Hx + Ex”). For “Hx” cases, GPT-3.5, GPT-4.0, and residents showed comparable accuracy in registration suggestions (63.5%, 77.9%, and 69.2%, respectively, P = 0.073), with ocular trauma, retinal diseases, and strabismus & amblyopia achieving the top three accuracy. For “Hx + Ex” cases, both GPT-4.0 and residents demonstrated higher diagnostic accuracy than GPT-3.5 (59.6% and 60.6% vs. 39.4%, P = 0.003 and P = 0.001). Accuracy for “do-not-miss” diagnoses also improved (76.0% and 65.4% vs. 49.0%, P < 0.001 and P = 0.015). The highest diagnostic accuracy were observed in glaucoma, lens diseases, and eyelid/lacrimal/orbital diseases. GPT-4.0 recorded fewer incorrect top-3 diagnosis (59.5% vs. 84.1%, P = 0.005) and more partially correct diagnosis (50% vs. 11.1%, P < 0.001) than GPT-3.5, while GPT-3.5 had more completely incorrect (42.9% vs. 16.7%, P = 0.005) and less precise diagnosis (34.9% vs. 11.9%, P = 0.009).
Conclusions:
GPT-3.5 and GPT-4.0 showed intermediate performance in recommending ophthalmic sub-specialties for registration. While GPT-3.5 under-performed, GPT-4.0 approached and numerically surpassed residents in differential diagnosis. AI chatbots show promise in facilitating ophthalmic patient registration. However, their integration into diagnostic decision-making requires more validation.
Citation
