Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Apr 30, 2025
Open Peer Review Period: May 5, 2025 - Jun 30, 2025
Date Accepted: Jun 11, 2025
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese

Shimizu S, Nishiyama T, Nagai H, Wakamiya S, Aramaki E

Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese

JMIR Med Inform 2025;13:e76773

DOI: 10.2196/76773

PMID: 40627819

PMCID: 12262928

Toward Cross-Hospital Deployment of NLP Systems: A Robustness Evaluation of Fine-Tuned LLMs for Japanese Disease Name Recognition

  • Seiji Shimizu; 
  • Tomohiro Nishiyama; 
  • Hiroyuki Nagai; 
  • Shoko Wakamiya; 
  • Eiji Aramaki

ABSTRACT

Background:

Disease name recognition is a fundamental task in clinical natural language processing (NLP), enabling the extraction of critical patient information from electronic health records (EHRs). While recent advances in large language models (LLMs) have shown promise, most evaluations have focused on English, and little is known about their robustness in low-resource languages such as Japanese. In particular, whether these models can perform reliably on previously unseen in-hospital data, which differs in writing styles and clinical cases from training data, has not been thoroughly investigated.

Objective:

This study evaluates the robustness of fine-tuned LLMs for disease name recognition in Japanese clinical notes, with a particular focus on their performance on in-hospital data that was unseen during training.

Methods:

We used two corpora for this study: (1) a publicly available set of Japanese case reports denoted by CR, and (2) a newly constructed corpus of progress notes denoted by PN, which are written by ten physicians to capture stylistic variations of in-hospital clinical notes. To reflect real-world deployment scenarios, we first fine-tuned models on CR. Specifically, we compare an LLM and a baseline masked language model (MLM). The models were then evaluated under two conditions: (1) on CR, representing the in-domain (ID) setting with the same document type as in training, and (2) on PN, representing the out-of-domain (OOD) setting with a different document type. Robustness was assessed by calculating the performance gap—the performance drop from ID to OOD setting.

Results:

The LLM demonstrated greater robustness, with a smaller performance gap in F1 scores (ID–OOD = −8.6) compared to the MLM baseline (ID–OOD = −13.9). This indicates more stable performance across ID and OOD settings, highlighting the effectiveness of fine-tuned LLM for reliable use in diverse clinical settings.

Conclusions:

Fine-tuned LLMs demonstrate superior robustness for disease name recognition in Japanese clinical notes with a smaller performance gap. These findings highlight the potential of LLMs as reliable tools for clinical NLP in low-resource language settings and support their deployment in real-world healthcare applications where documentation diversity is inevitable. Clinical Trial: None


 Citation

Please cite as:

Shimizu S, Nishiyama T, Nagai H, Wakamiya S, Aramaki E

Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese

JMIR Med Inform 2025;13:e76773

DOI: 10.2196/76773

PMID: 40627819

PMCID: 12262928

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.