Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Sep 6, 2024
Date Accepted: Jan 29, 2025

The final, peer-reviewed published version of this preprint can be found here:

Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models

Zada T, Tam N, Barnard F, Van Sittert M, Bhat V, Rambhatla S

Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models

JMIR Form Res 2025;9:e66207

DOI: 10.2196/66207

PMID: 40063849

PMCID: 11913316

Medical Misinformation in AI-Assisted Self-Diagnosis: The EvalPrompt Method for Analyzing Large Language Models

  • Troy Zada; 
  • Natalie Tam; 
  • Francois Barnard; 
  • Marlize Van Sittert; 
  • Venkat Bhat; 
  • Sirisha Rambhatla

ABSTRACT

Background:

Rapid integration of Large Language Models (LLMs) in healthcare is sparking global discussion about their potential to revolutionize healthcare quality and accessibility. At a time when improving healthcare quality and access remains a critical concern for countries worldwide, the ability of these models to pass medical exams is often cited as a reason to use them for medical training and diagnosis. However, the impact of their inevitable use as a self-diagnostic tool and their role in spreading healthcare misinformation has not been evaluated.

Objective:

This study aims to assess the effectiveness of LLMs, particularly ChatGPT, from the perspective of an individual self-diagnosing to better understand the clarity, correctness, and robustness of the models.

Methods:

We propose the comprehensive testing methodology Evaluation of LLM Prompts (EvalPrompt). This evaluation methodology utilizes multiple-choice medical licensing exam questions to evaluate LLM responses. Experiment 1 prompts ChatGPT with open-ended questions to mimic real-world self-diagnosis use cases, and Experiment 2 performs sentence dropout on the correct responses from Experiment 1 to mimic self-diagnosis with missing information. Humans then assess the responses returned by ChatGPT for both experiments to evaluate the clarity, correctness, and robustness of ChatGPT.

Results:

In Experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% of the questions by both non-experts and experts, with only 34% agreement between the two groups. Similarly, in Experiment 2, which assessed robustness, 61% of the responses continued to be categorized as correct by all assessors. As a result, in comparison to a passing threshold of 60%, ChatGPT-4.0 is considered incorrect and unclear, though robust. This indicates that sole reliance on ChatGPT-4.0 for self-diagnosis could increase the risk of individuals being misinformed.

Conclusions:

The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate. Any medical advice provided by LLMs should be cautiously approached due to the significant risk of misinformation. However, evidence suggests that LLMs are steadily improving and could potentially play a role in healthcare systems in the future. To address the issue of medical misinformation, there is a pressing need for the development of a comprehensive self-diagnosis dataset. This dataset could enhance the reliability of LLMs in medical applications by featuring more realistic prompt styles with minimal information across a broader range of medical fields.


 Citation

Please cite as:

Zada T, Tam N, Barnard F, Van Sittert M, Bhat V, Rambhatla S

Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models

JMIR Form Res 2025;9:e66207

DOI: 10.2196/66207

PMID: 40063849

PMCID: 11913316

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.