Accepted for/Published in: JMIR Formative Research
Date Submitted: Aug 1, 2024
Date Accepted: May 15, 2025
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Does a complex prompt alter the diagnostic accuracy of common ophthalmological conditions by GPT-4? : Data Project
ABSTRACT
Background:
The global incidence of blindness has continued to increase, despite the enactment of a Global Eye Health Action Plan by the World Health Assembly. This can be attributed, in part to an aging population, but also to the limited diagnostic resources within lower and middle income countries (LMICs). The advent of Artificial Intelligence (AI) within healthcare could pose a novel solution to combating the prevalence of blindness globally.
Objective:
The study aimed to establish if a complex prompt altered the diagnostic accuracy of common ophthalmological conditions by GPT-4 and quantify potential differences in performance.
Methods:
Two AI models (gpt-4-0125-preview and an altered version of the Alan super prompt running on gpt-4-0125-preview) were instructed to diagnose the condition present in 12 clinical vignettes. The vignettes comprised of five prevalent adult conditions, five prevalent childhood conditions and two control cases – one adult orientated and one child orientated. Through prompt engineering, the AI models were “forced” to solely provide the name of the diagnosis. Each vignette was presented to each model 100 times. The data then underwent statistical analysis. A Chi-Square Test of Independence compared the total true positives of the all the conditions between the two models. Additionally, statistical screening metrics– sensitivity, specificity, positive predictive value (PPV) and negative predictive value (NPV) – were used to determined accuracy of each model.
Results:
There was a significant difference between the AI models when analysing the total number of true positives for the conditions investigated (X2=428.86 and P=9.446e-87). The altered Alan super prompt performed at an increased rate for all conditions except retinopathy of prematurity (ROP) when compared to gpt-4-0125-preview.
Conclusions:
The study established that overall, the inclusion of a complex prompt positively affected the diagnostic accuracy of gpt-4-0125-preview. The greatest difference in the performance of the models was observable in conditions more prominent in LMICs. The results highlighted the potential impact that Alan could have on healthcare systems within LMICs as an augmentation of the medical diagnostic process. Although additional refinement is required to the altered Alan super prompt, the implementation of AI applications in healthcare systems within LMICs could improve patient outcomes in these regions.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.