Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Nov 25, 2024
Date Accepted: Jan 31, 2025

The final, peer-reviewed published version of this preprint can be found here:

Performance Improvement of a Natural Language Processing Tool for Extracting Patient Narratives Related to Medical States From Japanese Pharmaceutical Care Records by Increasing the Amount of Training Data: Natural Language Processing Analysis and Validation Study

Ohno Y, Aomori T, Nishiyama T, Kato R, Fujiki R, Ishikawa H, Kiyomiya K, Isawa M, Mochizuki M, Aramaki E, Ohtani H

Performance Improvement of a Natural Language Processing Tool for Extracting Patient Narratives Related to Medical States From Japanese Pharmaceutical Care Records by Increasing the Amount of Training Data: Natural Language Processing Analysis and Validation Study

JMIR Med Inform 2025;13:e68863

DOI: 10.2196/68863

PMID: 40053805

PMCID: 11920660

Performance Improvement of the Natural Language Processing Tool to Extract Patient Narratives Related to Medical States From Japanese Pharmaceutical Care Records by Increasing the Amount of Training Data: Natural Language Processing Analysis

  • Yukiko Ohno; 
  • Tohru Aomori; 
  • Tomohiro Nishiyama; 
  • Riri Kato; 
  • Reina Fujiki; 
  • Haruki Ishikawa; 
  • Keisuke Kiyomiya; 
  • Minae Isawa; 
  • Mayumi Mochizuki; 
  • Eiji Aramaki; 
  • Hisakazu Ohtani

ABSTRACT

Background:

To improve pharmacotherapy, patients’ oral expressions serve as valuable sources of clinical information. Natural language processing (NLP) is a useful approach for analyzing unstructured text data, such as patient narratives. However, few studies have focused on using NLP for narratives in the Japanese language.

Objective:

To develop a high-performance NLP system for extracting clinical information from patient narratives, we examined the performance progression as the amount of training data was gradually increased.

Methods:

Subjective texts from the pharmaceutical care records of Keio University Hospital from April 1, 2018 to March 31, 2019, comprising 12,004 records from 6,559 cases, were used. After preprocessing, we annotated diseases and symptoms within the texts. We then trained and evaluated deep learning models—bidirectional encoder representations from transformers combined with a conditional random field (BERT-CRF)—by 10-fold cross-validation. The annotated data were divided into 10 subsets, and the amount of training data was progressively increased over 10 steps. We also analyzed the causes of errors. Finally, we applied the developed system to the analysis of case report texts to evaluate its usability for texts from other sources.

Results:

The F1-score of the system improved from 0.67 to 0.82 as the amount of training data increased from 1,200 to 12,004 records. The F1-score reached 0.78 with 3,600 records and largely saturated thereafter. As performance improved, errors from incorrect extractions decreased significantly, increasing precision. For case reports, the F1-score also increased from 0.34 to 0.41 as the training dataset expanded from 1,200 to 12,004 records. Performance was lower for extracting symptoms from case report texts compared with pharmaceutical care records, suggesting that this system is more specialized for analyzing subjective data from pharmaceutical care records.

Conclusions:

We successfully developed a high-performance system specialized in analyzing subjective data from pharmaceutical care records by training a large dataset, with near-complete saturation of system performance with about 3,600 training records. This system will be useful for monitoring symptoms, offering benefits for both clinical practice and research.


 Citation

Please cite as:

Ohno Y, Aomori T, Nishiyama T, Kato R, Fujiki R, Ishikawa H, Kiyomiya K, Isawa M, Mochizuki M, Aramaki E, Ohtani H

Performance Improvement of a Natural Language Processing Tool for Extracting Patient Narratives Related to Medical States From Japanese Pharmaceutical Care Records by Increasing the Amount of Training Data: Natural Language Processing Analysis and Validation Study

JMIR Med Inform 2025;13:e68863

DOI: 10.2196/68863

PMID: 40053805

PMCID: 11920660

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.