JMIR Preprints #66207: Medical Misinformation in AI-Assisted Self-Diagnosis: The EvalPrompt Method for Analyzing Large Language Models

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Medical Misinformation in AI-Assisted Self-Diagnosis: The EvalPrompt Method for Analyzing Large Language Models

Troy Zada;
Natalie Tam;
Francois Barnard;
Marlize Van Sittert;
Venkat Bhat;
Sirisha Rambhatla

ABSTRACT

Background:

Rapid integration of Large Language Models (LLMs) in healthcare is sparking global discussion about their potential to revolutionize healthcare quality and accessibility. At a time when improving healthcare quality and access remains a critical concern for countries worldwide, the ability of these models to pass medical exams is often cited as a reason to use them for medical training and diagnosis. However, the impact of their inevitable use as a self-diagnostic tool and their role in spreading healthcare misinformation has not been evaluated.

Objective:

This study aims to assess the effectiveness of LLMs, particularly ChatGPT, from the perspective of an individual self-diagnosing to better understand the clarity, correctness, and robustness of the models.

Methods:

We propose the comprehensive testing methodology Evaluation of LLM Prompts (EvalPrompt). This evaluation methodology utilizes multiple-choice medical licensing exam questions to evaluate LLM responses. Experiment 1 prompts ChatGPT with open-ended questions to mimic real-world self-diagnosis use cases, and Experiment 2 performs sentence dropout on the correct responses from Experiment 1 to mimic self-diagnosis with missing information. Humans then assess the responses returned by ChatGPT for both experiments to evaluate the clarity, correctness, and robustness of ChatGPT.

Results:

In Experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% of the questions by both non-experts and experts, with only 34% agreement between the two groups. Similarly, in Experiment 2, which assessed robustness, 61% of the responses continued to be categorized as correct by all assessors. As a result, in comparison to a passing threshold of 60%, ChatGPT-4.0 is considered incorrect and unclear, though robust. This indicates that sole reliance on ChatGPT-4.0 for self-diagnosis could increase the risk of individuals being misinformed.

Conclusions:

The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate. Any medical advice provided by LLMs should be cautiously approached due to the significant risk of misinformation. However, evidence suggests that LLMs are steadily improving and could potentially play a role in healthcare systems in the future. To address the issue of medical misinformation, there is a pressing need for the development of a comprehensive self-diagnosis dataset. This dataset could enhance the reliability of LLMs in medical applications by featuring more realistic prompt styles with minimal information across a broader range of medical fields.

Citation

Please cite as:

Zada T, Tam N, Barnard F, Van Sittert M, Bhat V, Rambhatla S

Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models

JMIR Form Res 2025;9:e66207

DOI: 10.2196/66207

PMID: 40063849

PMCID: 11913316

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Sep 6, 2024

Date Accepted: Jan 29, 2025

Medical Misinformation in AI-Assisted Self-Diagnosis: The EvalPrompt Method for Analyzing Large Language Models

ABSTRACT

Citation

Copyright