Accepted for/Published in: JMIR AI
Date Submitted: Apr 24, 2025
Open Peer Review Period: Apr 22, 2025 - Jun 17, 2025
Date Accepted: Oct 1, 2025
(closed for review but you can still tweet)
Detection of Medical Misinformation in Hemangioma Patient Education: A Comparative Study of ChatGPT-4o and DeepSeek R1 Large Language Models
ABSTRACT
Background:
This study examines the capability of large language models in detecting medical rumors, using hemangioma-related information as an example. It compares the performance of ChatGPT 4o and DeepSeek R1.
Objective:
The objective of this study is to evaluate and compare the accuracy, stability, and expert-rated reliability of two large language models, ChatGPT 4o and DeepSeek R1, in classifying medical information related to hemangiomas as either "rumors" or "accurate information."
Methods:
The data comes from social media, medical education websites, and international medical guidelines, and has been labeled by medical experts as either "rumors" or "accurate information." The data was input into ChatGPT 4o and DeepSeek R1, generating two rounds of classification results and explanations. A BERT model was used to calculate the semantic similarity between the two rounds of output to evaluate stability, while confusion matrices and metrics including accuracy, precision, recall, and F1-score were used to quantify performance. The model outputs were then scored by experts, with analysis conducted through independent samples t-tests.
Results:
DeepSeek R1 achieved a classification accuracy of 0.963, surpassing ChatGPT 4o’s 0.910. Its precision (0.978 vs. 0.925), recall (0.957 vs. 0.908), and F1-score (0.967 vs. 0.916) were all superior. Expert ratings showed higher averages for DeepSeek R1 (4.35–4.48) than for ChatGPT 4o (4.04–4.07), but the difference in scores for “rumor” and “accurate information” categories between the two models was not statistically significant (P > .05). Semantic similarity analysis indicated comparable stability between the models (ChatGPT 4o: 0.900 ± 0.025; DeepSeek R1: 0.897 ± 0.032). A case comparison revealed that DeepSeek R1 could decisively refute rumors (e.g., “sun exposure aggravates hemangiomas”), whereas ChatGPT 4o, due to caution, often did not explicitly reject incorrect claims.
Conclusions:
DeepSeek R1 demonstrates greater accuracy and rationale in detecting medical rumors compared with ChatGPT 4o. This study provides empirical support for the application of large language models and recommends optimizing accuracy and incorporating real-time verification mechanisms to mitigate the harmful impact of misleading information on patient health.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.