JMIR Preprints #81399: Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: A Mixed Method Approach

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: A Mixed Method Approach

Livia Maria Strasser;
Wilma Anschuetz;
Fabio Dennstaedt;
Janna Hasting

ABSTRACT

Background:

Artificial intelligence continues to transform healthcare, offering promising applications in clinical practice and medical education. While large language models as a form of generative artificial intelligence have shown potential to match or surpass medical students in licensing examinations, their performance varies across languages. Recent studies highlight the complex influence and interdependency of factors such as language and model type on large language models’ accuracy, yet cross-language comparisons remain underexplored. Switzerland’s multilingual medical licensing exam provides a unique opportunity to investigate these dynamics.

Objective:

This study evaluates the performance of large language models in Swiss medical multiple-choice questions across three languages, aiming to uncover model capabilities in a multilingual medical education context.

Methods:

For this study, 150 publicly accessible multilingual multiple-choice questions from an online self-assessment tool were selected and analysed. A mixed-method approach was implemented using quantitative and qualitative methods to evaluate large language models outputs. Several large language models developed by OpenAI, MetaAI, Anthropic, MistralAI, and DeepSeek were evaluated by prompting them to answer these questions in a text-only format.

Results:

The performance of large language models on medical questions varied by model and language. While most models answered most multiple-choice questions correctly, accuracy differed across models. All models showed reasoning errors in the qualitative analysis and sometimes struggled to identify the most correct answers, despite factual accuracy on the represented topic being demonstrated.

Conclusions:

While our results are in line with previous demonstrations of the high potential of large language models in answering multilingual medical exam questions, this study highlights the importance of careful model selection, prompt design, and awareness of performance variability across languages. There is a need for ongoing evaluation as well as transparent reporting to ensure reliable integration of large language models into medical education contexts.

Citation

Please cite as:

Strasser LM, Anschuetz W, Dennstaedt F, Hasting J

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study

JMIR Med Educ 2026;12:e81399

DOI: 10.2196/81399

PMID: 41813244

PMCID: 12978932

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 28, 2025

Open Peer Review Period: Jul 28, 2025 - Sep 22, 2025

Date Accepted: Jan 9, 2026

(closed for review but you can still tweet)

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: A Mixed Method Approach

ABSTRACT

Citation

Copyright