Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Apr 30, 2025
Open Peer Review Period: May 5, 2025 - Jun 30, 2025
Date Accepted: Jun 11, 2025
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Toward Cross-Hospital Deployment of NLP Systems: A Robustness Evaluation of Fine-Tuned LLMs for Japanese Disease Name Recognition
ABSTRACT
Background:
Disease name recognition is a fundamental task in clinical natural language processing (NLP), enabling the extraction of critical patient information from electronic health records (EHRs). While recent advances in large language models (LLMs) have shown promise, most evaluations have focused on English, and little is known about their robustness in low-resource languages such as Japanese. In particular, whether these models can perform reliably on previously unseen in-hospital data, which differs in writing styles and clinical cases from training data, has not been thoroughly investigated.
Objective:
This study evaluates the robustness of fine-tuned LLMs for disease name recognition in Japanese clinical notes, with a particular focus on their performance on in-hospital data that was unseen during training.
Methods:
We used two corpora for this study: (1) a publicly available set of Japanese case reports denoted by CR, and (2) a newly constructed corpus of progress notes denoted by PN, which are written by ten physicians to capture stylistic variations of in-hospital clinical notes. To reflect real-world deployment scenarios, we first fine-tuned models on CR. Specifically, we compare an LLM and a baseline masked language model (MLM). The models were then evaluated under two conditions: (1) on CR, representing the in-domain (ID) setting with the same document type as in training, and (2) on PN, representing the out-of-domain (OOD) setting with a different document type. Robustness was assessed by calculating the performance gap—the performance drop from ID to OOD setting.
Results:
The LLM demonstrated greater robustness, with a smaller performance gap in F1 scores (ID–OOD = −8.6) compared to the MLM baseline (ID–OOD = −13.9). This indicates more stable performance across ID and OOD settings, highlighting the effectiveness of fine-tuned LLM for reliable use in diverse clinical settings.
Conclusions:
Fine-tuned LLMs demonstrate superior robustness for disease name recognition in Japanese clinical notes with a smaller performance gap. These findings highlight the potential of LLMs as reliable tools for clinical NLP in low-resource language settings and support their deployment in real-world healthcare applications where documentation diversity is inevitable. Clinical Trial: None
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.