Accepted for/Published in: JMIR Formative Research
Date Submitted: May 12, 2025
Date Accepted: Oct 16, 2025
Performance of DeepSeek-R1, ChatGPT (GPT-o3-mini), and Gemini 2.0 Flash on German Medical Multiple-Choice Questions: Comparative Evaluation
ABSTRACT
Background:
Despite the transformative potential of artificial intelligence–based chatbots in medicine, their implementation is hindered by data privacy and security concerns. DeepSeek offers a conceivable solution through its capability for local offline operation. However, it remains unclear whether DeepSeek can achieve an accuracy comparable to that of conventional, cloud-based AI chatbots. This study aims to evaluate whether DeepSeek meets this essential criterion, thereby assessing its suitability for secure clinical use.
Objective:
To evaluate whether DeepSeek, an AI-based chatbot capable of offline operation, achieves diagnostic accuracy comparable to that of leading chatbots—ChatGPT and Gemini—on German medical multiple-choice questions, thereby assessing its potential as a privacy-preserving alternative for clinical use.
Methods:
We evaluated DeepSeek’s accuracy using 200 German medical multiple-choice questions and compared it with that of ChatGPT and Gemini. Differences in accuracy as well as across medical specialties and word count were analyzed using McNemar’s test, Fisher’s exact test, and the Wilcoxon signed-rank test.
Results:
All chatbots achieved an accuracy ranging from 93% to 96%, exceeding the conventional passing threshold of 60%. No significant differences were observed between the chatbots - either overall or in pairwise comparisons. However, accuracy differed significantly in regard to word count across all examined chatbots.
Conclusions:
Overall, DeepSeek, not only demonstrates outstanding performance on German medical multiple-choice questions, comparable to the widely used chatbots ChatGPT and Gemini, but also offers lower financial and environmental costs, in addition to the possibility of offline operation. Nevertheless, the adoption of DeepSeek in the medical field remains restricted by hallucinations and biases, which means a critical examination of chatbot outputs is indispensable. Clinical Trial: NA
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.