Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 28, 2025
Open Peer Review Period: Jul 28, 2025 - Sep 22, 2025
Date Accepted: Jan 9, 2026
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study

Strasser LM, Anschuetz W, Dennstaedt F, Hasting J

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study

JMIR Med Educ 2026;12:e81399

DOI: 10.2196/81399

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: A Mixed Method Approach

  • Livia Maria Strasser; 
  • Wilma Anschuetz; 
  • Fabio Dennstaedt; 
  • Janna Hasting

ABSTRACT

Background:

Artificial intelligence continues to transform healthcare, offering promising applications in clinical practice and medical education. While large language models as a form of generative artificial intelligence have shown potential to match or surpass medical students in licensing examinations, their performance varies across languages. Recent studies highlight the complex influence and interdependency of factors such as language and model type on large language models’ accuracy, yet cross-language comparisons remain underexplored. Switzerland’s multilingual medical licensing exam provides a unique opportunity to investigate these dynamics.

Objective:

This study evaluates the performance of large language models in Swiss medical multiple-choice questions across three languages, aiming to uncover model capabilities in a multilingual medical education context.

Methods:

For this study, 150 publicly accessible multilingual multiple-choice questions from an online self-assessment tool were selected and analysed. A mixed-method approach was implemented using quantitative and qualitative methods to evaluate large language models outputs. Several large language models developed by OpenAI, MetaAI, Anthropic, MistralAI, and DeepSeek were evaluated by prompting them to answer these questions in a text-only format.

Results:

The performance of large language models on medical questions varied by model and language. While most models answered most multiple-choice questions correctly, accuracy differed across models. All models showed reasoning errors in the qualitative analysis and sometimes struggled to identify the most correct answers, despite factual accuracy on the represented topic being demonstrated.

Conclusions:

While our results are in line with previous demonstrations of the high potential of large language models in answering multilingual medical exam questions, this study highlights the importance of careful model selection, prompt design, and awareness of performance variability across languages. There is a need for ongoing evaluation as well as transparent reporting to ensure reliable integration of large language models into medical education contexts.


 Citation

Please cite as:

Strasser LM, Anschuetz W, Dennstaedt F, Hasting J

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study

JMIR Med Educ 2026;12:e81399

DOI: 10.2196/81399

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.