Accepted for/Published in: JMIR Formative Research
Date Submitted: Sep 6, 2024
Date Accepted: Jan 29, 2025
Medical Misinformation in AI-Assisted Self-Diagnosis: The EvalPrompt Method for Analyzing Large Language Models
ABSTRACT
Background:
Rapid integration of Large Language Models (LLMs) in healthcare is sparking global discussion about their potential to revolutionize healthcare quality and accessibility. At a time when improving healthcare quality and access remains a critical concern for countries worldwide, the ability of these models to pass medical exams is often cited as a reason to use them for medical training and diagnosis. However, the impact of their inevitable use as a self-diagnostic tool and their role in spreading healthcare misinformation has not been evaluated.
Objective:
This study aims to assess the effectiveness of LLMs, particularly ChatGPT, from the perspective of an individual self-diagnosing to better understand the clarity, correctness, and robustness of the models.
Methods:
We propose the comprehensive testing methodology Evaluation of LLM Prompts (EvalPrompt). This evaluation methodology utilizes multiple-choice medical licensing exam questions to evaluate LLM responses. Experiment 1 prompts ChatGPT with open-ended questions to mimic real-world self-diagnosis use cases, and Experiment 2 performs sentence dropout on the correct responses from Experiment 1 to mimic self-diagnosis with missing information. Humans then assess the responses returned by ChatGPT for both experiments to evaluate the clarity, correctness, and robustness of ChatGPT.
Results:
In Experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% of the questions by both non-experts and experts, with only 34% agreement between the two groups. Similarly, in Experiment 2, which assessed robustness, 61% of the responses continued to be categorized as correct by all assessors. As a result, in comparison to a passing threshold of 60%, ChatGPT-4.0 is considered incorrect and unclear, though robust. This indicates that sole reliance on ChatGPT-4.0 for self-diagnosis could increase the risk of individuals being misinformed.
Conclusions:
The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate. Any medical advice provided by LLMs should be cautiously approached due to the significant risk of misinformation. However, evidence suggests that LLMs are steadily improving and could potentially play a role in healthcare systems in the future. To address the issue of medical misinformation, there is a pressing need for the development of a comprehensive self-diagnosis dataset. This dataset could enhance the reliability of LLMs in medical applications by featuring more realistic prompt styles with minimal information across a broader range of medical fields.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.