JMIR Preprints #89156: Comparing the Accuracy of ChatGPT-4o, DeepSeek V.3, and Gemini 2.5 Flash in Answering Frequently Asked Questions to Systemic Lupus Erythematosus (SLE)

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Comparing the Accuracy of ChatGPT-4o, DeepSeek V.3, and Gemini 2.5 Flash in Answering Frequently Asked Questions to Systemic Lupus Erythematosus (SLE)

Alvina Widhani;
Suzy Maria;
Aisha Putri Chairani;
Nabila Y Adella Visco;
Muhammad Faiz Amirullah Nurhadi;
Lutfi Airlangga Harjoprawito;
Dwitya Elvira;
Irwin Tedja;
Yuniza Yuniza;
Deasy Fetarayani;
Sukamto Koesnoe;
Bramantya Wicaksana;
Anshari S Hasibuan;
Evy Yunihastuti

ABSTRACT

Background:

Systemic Lupus Erythematosus (SLE) is a complex, fluctuating disease, creating a continuous need for reliable patient information. A prior study concluded that SLE patients often turn to the internet, including AI chatbots, for information regarding SLE. The rise of AI chatbots as a primary information source presents a critical challenge regarding accuracy.

Objective:

This study aims to evaluate the performance of the latest generation of AI chatbots (ChatGPT-4o, DeepSeek V.3, and Gemini 2.5 Flash) when it comes to answering Frequently Asked Questions about SLE.

Methods:

Twenty-two frequently asked questions (FAQs) about SLE, in Bahasa Indonesia, were posed to each chatbot. Responses were independently and blindly evaluated for accuracy by five clinical immunologists using a 4-point Likert scale. Readability was assessed using the Flesch Reading Ease Indonesian (FRES score) formula. Statistical comparisons for accuracy and readability were performed (Kruskal-Wallis/ANOVA), and Spearman's Rho was used to correlate accuracy, readability, and word count.

Results:

Gemini 2.5 Flash demonstrated the highest accuracy with a mean (SD) of 1.25 (0.53), significantly outperforming DeepSeek V.3 (1.48 ± 0.63) and ChatGPT-4o (1.71 ± 0.61) (p < .001). Gemini 2.5 Flash also significantly scored better across all four subdomains. Readability for all three chatbots was low (median FRES score: 42.22–46.66). Gemini 2.5 Flash produced the longest responses (8509 words total), followed by DeepSeek V.3 (5410 words) and ChatGPT-4o (3632 words). A weak but significant positive correlation was found between word count and lower accuracy (ρ = +0.375, p = .002).

Conclusions:

Gemini 2.5 Flash provided the most accurate responses for SLE-related questions, demonstrating consistent performance across all domains. However, its clinical utility, along with that of ChatGPT-4o and DeepSeek V.3, is severely limited by low readability scores (FRES < 50), making it unsuitable for general patient use. This highlights a critical "blind spot" where clinical accuracy, as rated by experts, does not equate to patient accessibility. Thus, further research is required to assess accuracy and readability of Chatbot AIs across different medical fields and topics.

Citation

Please cite as:

Widhani A, Maria S, Chairani AP, Visco NYA, Nurhadi MFA, Harjoprawito LA, Elvira D, Tedja I, Yuniza Y, Fetarayani D, Koesnoe S, Wicaksana B, Hasibuan AS, Yunihastuti E

Comparing the Accuracy of ChatGPT-4o, DeepSeek-V3, and Gemini 2.5 Flash in Answering Frequently Asked Questions About Systemic Lupus Erythematosus: Quantitative Study

JMIR Form Res 2026;10:e89156

DOI: 10.2196/89156

PMID: 42258421

PMCID: 13245641

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Dec 10, 2025

Date Accepted: May 8, 2026

Comparing the Accuracy of ChatGPT-4o, DeepSeek V.3, and Gemini 2.5 Flash in Answering Frequently Asked Questions to Systemic Lupus Erythematosus (SLE)

ABSTRACT

Citation

Copyright