Accepted for/Published in: JMIR Medical Education
Date Submitted: Jul 28, 2025
Open Peer Review Period: Jul 28, 2025 - Sep 22, 2025
Date Accepted: Jan 9, 2026
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Performance Evaluation of Large Language Models in Multilingual Medical Multiple-Choice Questions: A Mixed Method Approach
ABSTRACT
Background:
Artificial intelligence continues to transform healthcare, offering promising applications in clinical practice and medical education. While large language models as a form of generative artificial intelligence have shown potential to match or surpass medical students in licensing examinations, their performance varies across languages. Recent studies highlight the complex influence and interdependency of factors such as language and model type on large language models’ accuracy, yet cross-language comparisons remain underexplored. Switzerland’s multilingual medical licensing exam provides a unique opportunity to investigate these dynamics.
Objective:
This study evaluates the performance of large language models in Swiss medical multiple-choice questions across three languages, aiming to uncover model capabilities in a multilingual medical education context.
Methods:
For this study, 150 publicly accessible multilingual multiple-choice questions from an online self-assessment tool were selected and analysed. A mixed-method approach was implemented using quantitative and qualitative methods to evaluate large language models outputs. Several large language models developed by OpenAI, MetaAI, Anthropic, MistralAI, and DeepSeek were evaluated by prompting them to answer these questions in a text-only format.
Results:
The performance of large language models on medical questions varied by model and language. While most models answered most multiple-choice questions correctly, accuracy differed across models. All models showed reasoning errors in the qualitative analysis and sometimes struggled to identify the most correct answers, despite factual accuracy on the represented topic being demonstrated.
Conclusions:
While our results are in line with previous demonstrations of the high potential of large language models in answering multilingual medical exam questions, this study highlights the importance of careful model selection, prompt design, and awareness of performance variability across languages. There is a need for ongoing evaluation as well as transparent reporting to ensure reliable integration of large language models into medical education contexts.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.