JMIR Preprints #69485: Performance of Large Language Models in Non-English context: An Evaluation of Models Trained on Different Languages in Chinese Medical Examinations

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Performance of Large Language Models in Non-English context: An Evaluation of Models Trained on Different Languages in Chinese Medical Examinations

Zhong Yao;
Liantan Duan;
Shuo Xu;
Lingyi Chi;
Dongfang Sheng

ABSTRACT

Background:

Research on large language models (LLMs) in the medical field has predominantly focused on models trained with English-language corpora, evaluating their performance within English-speaking contexts. The performances of models trained with non-English language corpora and their performance in non-English contexts remain underexplored.

Objective:

We used the Chinese Medical Examination (CNMLE) as a benchmark and constructed analogous questions to evaluate the performances of LLMs trained on different languages corpora.

Methods:

Under different prompt settings, we sequentially posed questions to seven LLMs: two primarily trained on English-language corpora and five primarily on Chinese-language corpora. The models' responses were compared against standard answers to calculate the accuracy rate of each model. Further subgroup analyses were conducted by categorizing the questions based on various criteria. We also collected error sets to explore patterns of mistakes across different models.

Results:

Under zero-shot setting, six out of seven models exceeded the passing level, with the highest accuracy rate achieved by the Chinese LLM Baichuan (86.67%), followed by ChatGPT (83.83%). In the constructed questions, all seven models exceeded the passing threshold, with Baichuan maintaining the highest accuracy rate (87.00%). In few-shot learning, all models exceeded the passing threshold. Baichuan, ChatGLM, and ChatGPT retained the highest accuracy. While Llama showed marked improvement over prior tests, the relative performance rankings of other models stayed similar to previous results. In subgroup analyses, English models demonstrated comparable or superior performance to Chinese models on questions related to ethics and policy. All models except Llama generally had higher accuracy rates for simple questions compared to complex ones. The error set of ChatGPT was similar to those of other Chinese models. Multi-model cross verification outperformed single model, particularly improving accuracy rate on simple questions. The implementation of dual-model and tri-model verification achieved accuracy rates of 94.17% and 96.33% respectively.

Conclusions:

At the current level, LLMs trained primarily on English corpora and those trained mainly on Chinese corpora perform similarly well in CNMLE, with Chinese models still outperforming. The performance difference between ChatGPT and other Chinese LLMs are not solely due to communication barriers but are more likely influenced by disparities in the training data. By employing a method of cross-verification with multiple LLMs, excellent performance can be achieved in medical examinations.

Citation

Please cite as:

Yao Z, Duan L, Xu S, Chi L, Sheng D

Performance of Large Language Models in the Non-English Context: Qualitative Study of Models Trained on Different Languages in Chinese Medical Examinations

JMIR Med Inform 2025;13:e69485

DOI: 10.2196/69485

PMID: 40577654

PMCID: 12227152

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Dec 2, 2024

Date Accepted: Apr 24, 2025

Performance of Large Language Models in Non-English context: An Evaluation of Models Trained on Different Languages in Chinese Medical Examinations

ABSTRACT

Citation

Copyright