Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Aug 13, 2021
Date Accepted: Dec 16, 2021
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Strategies to address the lack of labelled data for supervised machine learning training with electronic health records: a case study for extraction of symptoms from clinical notes
ABSTRACT
Background:
Automated extraction of symptoms from clinical notes is a challenging task due to the multidimensional nature of symptom description. The availability of labeled training data is extremely limited due to the nature of the data itself containing protected health information. Natural language processing and machine learning to process clinical text for such task has great potential. However, (supervised) machine learning requires a great amount of labeled data to train a model, which is at the origin of the main bottleneck of model development.
Objective:
Our aim for this study is to address the lack of labeled data by proposing two alternatives to manual labeling for the generation of training labels for supervised machine learning with English clinical text. We aim to demonstrate that using silver quality labels for training leads to good classification results.
Methods:
We address the lack of labels with two strategies. The first approach takes advantage of the structured part of EHRs and uses diagnosis codes (ICD-10) to derive training labels. The second approach uses weak supervision and data programming principles to derive training labels. We propose to apply the developed framework to the extraction of symptoms information from outpatient visits progress notes of cardiovascular patients.
Results:
We used over 500,000 notes for training our classification model with ICD-10 codes as labels, and more than 800,000 notes for training using labels derived from weak supervision. We show that the dependence between prevalence and recall becomes flat provided a sufficiently large training set is used (over 500,000 documents). We further demonstrate that using weak labels for training rather than the EHR codes derived from the patient encounter leads to an overall improved recall score (10% improvement on average). Finally, the external validation of our models shows excellent predictive performance and transferability, with an overall increase of 20% of the recall score
Conclusions:
This work demonstrates the power of using a weak labeling pipeline to annotate and extract symptom mentions in clinical text, with the prospects to facilitate symptom information integration for downstream clinical task such as clinical decision support.
Citation