Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Dec 10, 2024
Date Accepted: Jun 4, 2025
Performance of Open-Source Large language Models in Psychiatry: A Comparative Analysis of Non-English Records and English Translations
ABSTRACT
Background:
Inequalities in access to psychiatric care remain a persistent issue. While large language models offer potential solutions, closed models like ChatGPT have limitations including privacy concerns. Open-source models have advantages such as enhanced data security and the ability to operate effectively in resource-limited settings. However, the effectiveness of open-source models in non-English psychiatric contexts also remains underexplored.
Objective:
We aimed to evaluate the feasibility of an open-source large language model in Korean and English for psychiatric application and to explore its potential to improve mental healthcare access in resource-limited settings for non-English speaking populations.
Methods:
The openbuddy-mistral-7b-v13.1 model, fine-tuned from Mistral 7B to enable conversational capabilities in Korean, was selected. A total of 200 psychiatric interview notes consisting of 50 cases each of schizophrenia, bipolar disorder, depressive disorder, and anxiety disorder were analyzed. The model generated English translations from the Korean interview notes. From both the original Korean notes and their English translations, the model was instructed to extract clinically meaningful clues and identify the possible diagnoses. Additionally, the model's performance on the psychiatry section of the Korean Medical Licensing Examination was evaluated using a similar approach.
Results:
The model generated 997 clues from Korean interview notes and 1,003 clues from English-translated notes. Hallucinations were more frequent with Korean input (30.2%) compared to English input (13.4%). Clinical reasoning was superior for English input, with 42.8% of clues showed diagnostic relevance, compared to 34.2% for Korean input. The top-1 diagnostic accuracy was also higher for English input (74.5%) compared to Korean input (59%). In the psychiatry section of the medical licensing examination, the model demonstrated better performance in English, achieving an accuracy of 46.1% compared to 32.2% in Korean.
Conclusions:
The findings of this study suggest that the performance of open-source LLMs in psychiatry may vary by language, especially in resource-limited settings. Addressing this issue may require collaborative efforts, such as the development of psychiatric datasets in the respective languages. Continuous efforts are necessary to create multilingual open-source LLMs capable of supporting psychiatric applications, thereby improving accessibility to mental healthcare.
Citation
Request queued. Please wait while the file is being generated. It may take some time.