Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Dec 27, 2023
Open Peer Review Period: Jan 8, 2024 - Mar 4, 2024
Date Accepted: Mar 19, 2024
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Leveraging Large Language Models for Improved Patient Access and Self-Management in Oral Healthcare: A Preclinical Study
ABSTRACT
Background:
While Large Language Models like ChatGPT and Google Bard have shown significant promise in various fields, their broader impact on enhancing patient healthcare access and quality, particularly in specialized domains like oral health, requires comprehensive evaluation.
Objective:
This study aims to assess the effectiveness of Google Bard, ChatGPT-3.5, and ChatGPT-4 in offering recommendations for common oral health issues, benchmarked against responses from human dental experts.
Methods:
This comparative analysis utilized forty questions derived from patient surveys on prevalent oral diseases, executed in a simulated clinical environment. Responses were sourced from both human experts and Large Language Models, evaluating them on readability, appropriateness, harmlessness, comprehensiveness, intent capture, and helpfulness, as evaluated by experienced dentists and lay users, respectively. Additionally, the stability of AI responses was also assessed by submitting each question three times under consistent conditions.
Results:
Google Bard exhibited the best readability among all groups but scored significantly lower in appropriateness compared to human experts (8.51 ± 0.37 VS. 9.60 ± 0.33, P = .034), while ChatGPT-3.5 and 4 performed comparably with human experts in appropriateness (8.96 ± 0.35 and 9.34 ± 0.47, respectively). All three Large Language Models received superior harmlessness score, comparable to human experts. Lay users found no significant difference in helpfulness and intent capture between Large Language Models and human experts. Stability evaluation revealed ChatGPT-4 as the most reliable, with the highest number of correct responses and the least number of incorrect and unreliable responses.
Conclusions:
Large Language Models, particularly ChatGPT-4, show potential in oral healthcare, providing patient-centric information for enhancing patient education and clinical care. The observed performance variations underscore the need for ongoing refinement and ethical considerations in healthcare settings. Future research focus on developing strategies for safe integration of Large Language Models in healthcare settings. Clinical Trial: NA
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.