Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Nov 8, 2023
Date Accepted: May 29, 2024

The final, peer-reviewed published version of this preprint can be found here:

Predictive Model for Extended-Spectrum β-Lactamase–Producing Bacterial Infections Using Natural Language Processing Technique and Open Data in Intensive Care Unit Environment: Retrospective Observational Study

Ito G, Yada S, Wakamiya S, Aramaki E

Predictive Model for Extended-Spectrum β-Lactamase–Producing Bacterial Infections Using Natural Language Processing Technique and Open Data in Intensive Care Unit Environment: Retrospective Observational Study

JMIR Form Res 2024;8:e54044

DOI: 10.2196/54044

PMID: 38986131

PMCID: 11269962

Predictive Model for Extended-Spectrum Beta-Lactamase-Producing Bacterial Infections Using Natural Language Processing Technique and Open Data In Intensive Care Unit Environment: Retrospective Observational Study

  • Genta Ito; 
  • Shuntaro Yada; 
  • Shoko Wakamiya; 
  • Eiji Aramaki

ABSTRACT

Background:

Machine learning has recently helped create models for predicting medical events, mostly using private datasets. The MIMIC-3 dataset, which is public and contains detailed data on over 40,000 intensive care unit patients, stands out as it can help develop better models using not only structured data but also text from medical records.

Objective:

This study aimed to build and test a machine learning model using the MIMIC-3 dataset to determine the effectiveness of information extracted from electronic medical record text using named entity recognition (NER), specifically QuickUMLS, for predicting important medical events. Using the prediction of extended-spectrum beta-lactamase (ESBL)-producing bacterial infections as an example, this study shows how open data sources and simple technology can be useful for making clinically meaningful predictions.

Methods:

The MIMIC-3 dataset, which offers a wealth of information, including demographics, vital signs, laboratory results, and textual data, such as discharge summaries, was used. This study specifically targeted patients diagnosed with Klebsiella pneumoniae or Escherichia coli infection. Predictions were based on ESBL-producing bacterial standards and the minimum inhibitory concentration criteria. Both the structured data and extracted patient histories were used as predictors. Two models, an L1-regularized logistic regression model and a LightGBM model, were evaluated using the receiver operating characteristic area under the curve (ROC-AUC) and the precision-recall curve area under the curve (PR-AUC).

Results:

Of 46,520 MIMIC-3 patients, 4,046 were identified using bacterial culture tests, indicating the presence of K. pneumoniae or E. coli. After excluding patients who lacked discharge summary text, 3,614 patients remained. The L1-penalized model, with variables from only the structured data, displayed an ROC-AUC of 0.646 and a PR-AUC of 0.307. The LightGBM model with variables from the structured and text data outperformed the former, with an ROC-AUC of 0.707 and PR-AUC of 0.369. Key contributors to the LightGBM model included patient age, duration since hospital admission, and specific medical history such as diabetes.

Conclusions:

The structured data-based model showed improved performance compared to the reference models. Performance was further improved when textual medical history was included. Compared to other models predicting drug-resistant bacteria, the results of this study ranked in the middle. Some misidentifications, potentially due to the limitations of QuickUMLS, may have affected the accuracy of the mode. This study successfully developed a predictive model for ESBL-producing bacterial infections using the MIMIC-3 dataset, yielding results consistent with existing literature. This model stands out for its transparency and reliance on open data and open NER technology. The performance of the model was enhanced using textual information. With advancements in natural language processing tools, such as BERT and GPT, the extraction of medical data from textual sources holds substantial potential for future model optimization.


 Citation

Please cite as:

Ito G, Yada S, Wakamiya S, Aramaki E

Predictive Model for Extended-Spectrum β-Lactamase–Producing Bacterial Infections Using Natural Language Processing Technique and Open Data in Intensive Care Unit Environment: Retrospective Observational Study

JMIR Form Res 2024;8:e54044

DOI: 10.2196/54044

PMID: 38986131

PMCID: 11269962

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.