Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Nov 21, 2023
Date Accepted: Mar 22, 2024
Utility of Large Language Models for Healthcare Professionals and Patients in navigating Hematopoietic Stem Cell Transplant: Comparison of Chat-GPT3.5, Chat-GPT4 and Bard performance
ABSTRACT
Background:
Artificial intelligence is increasingly being applied to many workflows. Large language models (LLMs) are publicly accessible platforms trained to understand, interact with, and produce human-readable text; their ability to deliver relevant and reliable medical information is of particular interest also for the healthcare providers and the patients. Hematopoietic stem cell transplant (HSCT) is a complex medical field requiring extensive knowledge, background, and training to practice successfully, and can be challenging for the non-specialist audience to comprehend.
Objective:
We aimed to test the applicability of three prominent LLMs, ChatGPT-3.5, ChatGTP4 and Google Bard, in guiding non-specialist healthcare professionals and advising patients seeking information regarding HSCT.
Methods:
We first submitted a large pool of open-ended HSCT-related questions of increasing difficulty to the LLMs, and rated their responses based on consistency, defined as replicability of the response when the same question was submitted multiple times, response veracity, language comprehensibility, specificity to the topic, and presence of hallucinations. Subsequently, we selected the two best performing chatbots, and rechallenged them by resubmitting the most difficult questions and prompting to respond as if communicating with either a healthcare professional or a patient, and to provide verifiable sources of information. Responses were then rated again with the additional criterion of language appropriateness, defined as language adaptation for the intended audience, to evaluate the chabot's ability in conveying the same information using either simple or more technical terminology.
Results:
Chat-GPT4 outperformed both Chat-GPT3.5 and gBard in terms of response consistency, response veracity, and specificity to the topic. Both Chat-GPT3.5 and Chat-GPT4 outperformed gBard in terms of language comprehensibility. All chatbots displayed episodes of hallucinations. Chat-GPT3.5 and Chat-GPT4 were then rechallenged prompting to adapt their language to the audience and to provide source of information, and response were rated. Chat-GPT3.5 showed a better ability to adapt its language to non-medical audience, using friendly tone and showing emotional support; however, both failed to provide correct and up-to-date information resources, reporting either out-of-date materials, incorrect URLs, or unfocused references, making their output not verifiable by the reader.
Conclusions:
In conclusion, despite LLMs’ potential capability in confronting challenging medical topics like HSCT, presence of mistakes and lack of clear references make them not yet appropriate for routine, unsupervised clinical use, or patient counselling. Implementation of LLMs’ ability to access and to reference current and updated websites and research articles, as well as development of LLMs trained in specialized domain knowledge datasets, may offer potential solutions for their future clinical application.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.