Accepted for/Published in: JMIR Formative Research
Date Submitted: Aug 14, 2023
Date Accepted: Dec 4, 2023
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Exploring the Potential of ChatGPT-4 in Predicting Refractive Surgery Categorizations: A Comparative Study
ABSTRACT
Background:
Refractive surgery research aims to optimize patient categorization for ideal procedures, minimizing risks while maximizing outcomes. Recent advances have led to the development of AI-powered algorithms, including machine learning (ML) approaches, to assess risks and enhance workflow. Large language models (LLMs) like ChatGPT-4 have emerged as potential general AI tools that can assist across various disciplines, including refractive surgery decision-making. However, their capabilities in pre-categorizing refractive surgery patients based on real-world parameters remain unexplored.
Objective:
This exploratory study aimed to examine ChatGPT-4's capabilities in pre-categorizing refractive surgery patients based on commonly used clinical parameters. The goal was to assess whether ChatGPT-4 could provide meaningful categorizations based on batch processed inputs, comparable to those made by a refractive surgeon.
Methods:
Data from 100 consecutive patients from a refractive clinic were anonymized and analyzed. Parameters included age, sex, manifest refraction, visual acuity, and various corneal measurements and indices from Scheimpflug imaging. The study compared ChatGPT-4's performance with a clinician's categorizations using Cohen's Kappa coefficient, a confusion matrix, and descriptive statistics.
Results:
A statistically significant non-coincidental accordance was found between ChatGPT-4 and the clinician's categorizations with a Cohen's Kappa coefficient of 0.399 for six categories (confidence interval [0.256;0.537]) and 0.610 for binary categorization (confidence interval [0.372;0.792]). The model showed temporal instability and response variability. The Chi-Squared test showed significant differences in categorization distributions (Χ²=94.7, p<0.01), and Fischer’s exact test for binary categorizations resulted in an Odds Ratio of 27.9 and a p-value of <0.01.
Conclusions:
The study revealed that ChatGPT-4 exhibits potential as a pre-categorization tool in refractive surgery, showing promising agreement with clinician categorizations. However, limitations such as temporal instability and variability between iterations indicate room for improvement. The results encourage further exploration into the application of LLMs like ChatGPT-4 in healthcare, particularly in decision-making processes that require understanding vast clinical data. Future research should focus on refining the model's accuracy, expanding the variables used for classification, and exploring the boundaries of its limitations to pave the way for large-scale validation and real-world implementation. Clinical Trial: none
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.