JMIR Preprints #77357: Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation

Annika Meyer;
Yassin Karay;
Andrea U. Steinbicker;
Thomas Streichert;
Remco Overbeek

ABSTRACT

Background:

Despite the transformative potential of artificial intelligence–based chatbots in medicine, their implementation is hindered by data privacy and security concerns. DeepSeek offers a conceivable solution through its capability for local offline operation. However, it remains unclear whether DeepSeek can achieve an accuracy comparable to that of conventional, cloud-based AI chatbots. This study aims to evaluate whether DeepSeek meets this essential criterion, thereby assessing its suitability for secure clinical use.

Objective:

To evaluate whether DeepSeek, an AI-based chatbot capable of offline operation, achieves diagnostic accuracy comparable to that of leading chatbots—ChatGPT and Gemini—on German medical multiple-choice questions, thereby assessing its potential as a privacy-preserving alternative for clinical use.

Methods:

We evaluated DeepSeek’s accuracy using 200 German medical multiple-choice questions and compared it with that of ChatGPT and Gemini. Differences in accuracy as well as across medical specialties and word count were analyzed using McNemar’s test, Fisher’s exact test, and the Wilcoxon signed-rank test.

Results:

All chatbots achieved an accuracy ranging from 93% to 96%, exceeding the conventional passing threshold of 60%. No significant differences were observed between the chatbots - either overall or in pairwise comparisons. However, accuracy differed significantly in regard to word count across all examined chatbots.

Conclusions:

Overall, DeepSeek, not only demonstrates outstanding performance on German medical multiple-choice questions, comparable to the widely used chatbots ChatGPT and Gemini, but also offers lower financial and environmental costs, in addition to the possibility of offline operation. Nevertheless, the adoption of DeepSeek in the medical field remains restricted by hallucinations and biases, which means a critical examination of chatbot outputs is indispensable. Clinical Trial: NA

Citation

Please cite as:

Meyer A, Karay Y, Steinbicker AU, Streichert T, Overbeek R

Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation

JMIR Form Res 2025;9:e77357

DOI: 10.2196/77357

PMID: 41411646

PMCID: 12757712

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: May 12, 2025

Date Accepted: Oct 16, 2025

Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation

ABSTRACT

Citation

Copyright