Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Education

Date Submitted: Jul 28, 2025
Open Peer Review Period: Jul 28, 2025 - Sep 22, 2025
Date Accepted: Jan 9, 2026
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study

Strasser LM, Anschuetz W, Dennstädt F, Hastings J

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study

JMIR Med Educ 2026;12:e81399

DOI: 10.2196/81399

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: A Mixed Method Approach

  • Livia Maria Strasser; 
  • Wilma Anschuetz; 
  • Fabio Dennstädt; 
  • Janna Hastings

ABSTRACT

Background:

Artificial intelligence continues to transform healthcare, offering promising applications in clinical practice and medical education. While large language models as a form of generative artificial intelligence have shown potential to match or surpass medical students in licensing examinations, their performance varies across languages. Recent studies highlight the complex influence and interdependency of factors such as language and model type on large language models’ accuracy, yet cross-language comparisons remain underexplored.

Objective:

This study evaluates the performance of large language models in answering medical multiple-choice questions quantitively and qualitatively across three languages (German, French and Italian), aiming to uncover model capabilities in a multilingual medical education context.

Methods:

For this mixed-method study, 114 publicly accessible multiple-choice questions in German, French and Italian from an online self-assessment tool were analysed. A quantitative performance analysis of several large language models developed by OpenAI, MetaAI, Anthropic, and DeepSeek was conducted to evaluate their performance on answering the questions in text-only format. For the comparative analysis a variation of input question language (German, French, Italian) and prompt language (English versus language-matched) was used. The two best-performing large language models were then prompted to provide answer explanations for incorrectly answered questions. A subsequent qualitative analysis was conducted on these explanations to identify the reasons leading to the incorrect answers.

Results:

The performance of large language models in answering medical multiple-choice questions varied by model and language, showing substantial differences in accuracy (between 64% and 87%). The effect of input question language was significant (p-value <.01) with models performing best on German questions. Across the analysed large language models, prompting in English generally led to better performance in comparison to language-matched prompts, but the top-performing models exceptionally showed comparable results for language-matched prompts. Qualitative analysis revealed that answer explanations of the analysed models (GPT4o and Claude-Sonnet-3.7) showed different reasoning errors. In several explanations, this occurred despite factual accuracy on the represented topic. Furthermore, this analysis revealed three questions to be insufficiently precise.

Conclusions:

Our results underline the potential of large language models in answering medical exam questions and highlight the importance of careful consideration of model choice, prompt and input languages, because of relevant performance variability across these factors. Analysis of answer explanations demonstrates a valuable use case of large language models for improving exam question quality in medical education, if data security regulations permit their use. Human oversight of language-sensitive or clinically nuanced content remains essential to determine whether incorrect output stem from flaws in the questions itself or from errors generated by the large language models. There is a need for ongoing evaluation as well as transparent reporting to ensure reliable integration of large language models into medical education contexts. Clinical Trial: Not applicable.


 Citation

Please cite as:

Strasser LM, Anschuetz W, Dennstädt F, Hastings J

Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: Mixed Methods Study

JMIR Med Educ 2026;12:e81399

DOI: 10.2196/81399

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.