Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Sep 17, 2025
Date Accepted: Mar 31, 2026

The final, peer-reviewed published version of this preprint can be found here:

Scalable Identification of Clinically Relevant Chronic Obstructive Pulmonary Disease Documents in Large-Scale Electronic Health Record Datasets With a Lightweight Natural Language Processing Model: Retrospective Cohort Study

Al-Garadi M, Davis SE, Matheny ME, Westerman D, Conger AK, RIchmond BW, Lasko TA, Ricket IM, Paulin LM, Brown JR, Reeves RM

Scalable Identification of Clinically Relevant Chronic Obstructive Pulmonary Disease Documents in Large-Scale Electronic Health Record Datasets With a Lightweight Natural Language Processing Model: Retrospective Cohort Study

JMIR Med Inform 2026;14:e84326

DOI: 10.2196/84326

PMID: 42119137

Scalable Identification of Clinically Relevant COPD Documents: A Lightweight NLP Model for Large-Scale EHR Datasets

  • Mohammed Al-Garadi; 
  • Sharon E. Davis; 
  • Michael E. Matheny; 
  • Dax Westerman; 
  • Adrienne K. Conger; 
  • Bradley W. RIchmond; 
  • Thomas A. Lasko; 
  • Iben M. Ricket; 
  • Laura M. Paulin; 
  • Jeremiah R. Brown; 
  • Ruth M. Reeves

ABSTRACT

Background:

The widespread adoption of electronic health records has resulted in the generation of large volumes of clinical notes. Learning algorithms and large language models train from these resources but are susceptible to noise—irrelevant or non-informative data from them. This sensitivity can lead to significant challenges, including performance degradation and the generation of inaccurate predictions or "hallucinations." This study addresses a critical challenge in clinical informatics: efficiently filtering millions of documents for relevance before advanced language model processing, particularly in resource-constrained environments.

Objective:

We present a novel framework for determining document relevance in clinical settings, utilizing a chronic obstructive pulmonary disease (COPD) dataset.

Methods:

We developed a novel framework using weak supervision and domain-expert heuristics to generate "silver standard" labels for training data and expert annotated labels (gold stand), creating two datasets to optimize the model during the development phase and subsequent testing phase . Various text representation techniques, including Bag-of-Words, TF-IDF, lightweight document embeddings, compression-based features, and UMLS concept extraction, were evaluated. These representations were used to train Random Forest, XGBoost, and K-Nearest Neighbors classifiers. Models were optimized on a small expert-annotated dataset and evaluated on a held-out test set.

Results:

The combination of lightweight document embedding with a Random Forest classifier demonstrated the best performance, achieving a precision of 0.75, recall of 0.89, and F1-score of 0.81 (95% CI: 0.76-0.87) for identifying relevant COPD documents. This significantly outperformed baseline heuristics (precision: 0.70, recall: 0.38, F1-score: 0.50, 95% CI: 0.43-0.56) and other tested methods.

Conclusions:

Our study presents a novel framework for identifying COPD-relevant clinical documents using lightweight embedding and machine learning. This approach effectively filters pertinent documents, enhancing information retrieval precision. The framework's scalability and minimal annotation needs make it promising for diverse healthcare applications, potentially optimizing clinical outcomes through efficient document selection for data-driven decision support systems.


 Citation

Please cite as:

Al-Garadi M, Davis SE, Matheny ME, Westerman D, Conger AK, RIchmond BW, Lasko TA, Ricket IM, Paulin LM, Brown JR, Reeves RM

Scalable Identification of Clinically Relevant Chronic Obstructive Pulmonary Disease Documents in Large-Scale Electronic Health Record Datasets With a Lightweight Natural Language Processing Model: Retrospective Cohort Study

JMIR Med Inform 2026;14:e84326

DOI: 10.2196/84326

PMID: 42119137

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.