JMIR Preprints #32903: Strategies to address the lack of labelled data for supervised machine learning training with electronic health records: a case study for extraction of symptoms from clinical notes

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Strategies to address the lack of labelled data for supervised machine learning training with electronic health records: a case study for extraction of symptoms from clinical notes

Marie Humbert-Droz;
Pritam Mukherjee;
Olivier Gevaert

ABSTRACT

Background:

Automated extraction of symptoms from clinical notes is a challenging task due to the multidimensional nature of symptom description. The availability of labeled training data is extremely limited due to the nature of the data itself containing protected health information. Natural language processing and machine learning to process clinical text for such task has great potential. However, (supervised) machine learning requires a great amount of labeled data to train a model, which is at the origin of the main bottleneck of model development.

Objective:

Our aim for this study is to address the lack of labeled data by proposing two alternatives to manual labeling for the generation of training labels for supervised machine learning with English clinical text. We aim to demonstrate that using silver quality labels for training leads to good classification results.

Methods:

We address the lack of labels with two strategies. The first approach takes advantage of the structured part of EHRs and uses diagnosis codes (ICD-10) to derive training labels. The second approach uses weak supervision and data programming principles to derive training labels. We propose to apply the developed framework to the extraction of symptoms information from outpatient visits progress notes of cardiovascular patients.

Results:

We used over 500,000 notes for training our classification model with ICD-10 codes as labels, and more than 800,000 notes for training using labels derived from weak supervision. We show that the dependence between prevalence and recall becomes flat provided a sufficiently large training set is used (over 500,000 documents). We further demonstrate that using weak labels for training rather than the EHR codes derived from the patient encounter leads to an overall improved recall score (10% improvement on average). Finally, the external validation of our models shows excellent predictive performance and transferability, with an overall increase of 20% of the recall score

Conclusions:

This work demonstrates the power of using a weak labeling pipeline to annotate and extract symptom mentions in clinical text, with the prospects to facilitate symptom information integration for downstream clinical task such as clinical decision support.

Citation

Please cite as:

Humbert-Droz M, Mukherjee P, Gevaert O

Strategies to Address the Lack of Labeled Data for Supervised Machine Learning Training With Electronic Health Records: Case Study for the Extraction of Symptoms From Clinical Notes

JMIR Med Inform 2022;10(3):e32903

DOI: 10.2196/32903

PMID: 35285805

PMCID: 8961340