JMIR Preprints #15980: A text mining approach to cohort selection from longitudinal patient records

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

A text mining approach to cohort selection from longitudinal patient records

Irena Spasić;
Dominik Krzemiński;
Padraig Corcoran;
Alexander Balinsky

ABSTRACT

Background:

Clinical trials are an important step in introducing new interventions into clinical practice by generating data on their safety and efficacy. Clinical trials need to ensure that participants are similar to be able to attribute any findings to the interventions studied and not some other factors. Therefore, each clinical trial defines eligibility criteria, which describe characteristics that must be shared by the participants. Unfortunately, the complexities of eligibility criteria may not allow them to be translated directly into readily executable database queries. Instead, they may require careful analysis of the narrative sections of medical records. Manual screening of medical records is time consuming, thus negatively affecting the timeliness of the recruitment process.

Objective:

The Track 1 of the 2018 National NLP Clinical Challenge (n2c2) focused on the task of cohort selection for clinical trials with the aim of answering the following question: "Can natural language processing be applied to narrative medical records to identify patients who meet eligibility criteria for clinical trials?" The task required the participating systems to analyze longitudinal patient records to determine if the corresponding patients met the given eligibility criteria. This article describes a system developed to address this task.

Methods:

Our system consists of 13 classifiers, one for each eligibility criterion. All classifiers use a bag-of-words document representation model. To prevent the loss of relevant contextual information associated with such representation, a pattern matching approach is used to extract context-sensitive features. They are embedded back into the text as lexically distinguishable tokens, which will consequently be featured in the bag-of-words representation. Supervised machine learning was chosen wherever a sufficient number of both positive and negative instances were available to learn from. A rule–based approach focusing on a small set of relevant features was chosen for the remaining criteria.

Results:

The system was evaluated using micro-averaged F–measure. Four machine algorithms, including support vector machine, logistic regression, naïve Bayesian classifier and gradient tree boosting, were evaluated on the training data using 10–fold cross-validation. Overall, gradient tree boosting demonstrated the most consistent performance. Its performance peaked when oversampling was used to balance the training data. Final evaluation was performed on previously unseen test data. On average, the F-measure of 89.04% was comparable to three of the top ranked performances in the shared task (91.11%, 90.28% and 90.21%). With F-measure of 88.14%, we significantly outperformed these systems (81.03%, 78.50% and 70.81%) in identifying patients with advanced coronary artery disease.

Conclusions:

The holdout evaluation provides evidence that our system was able to identify eligible patients for the given clinical trial with high accuracy. Our approach demonstrates how rule-based knowledge infusion can improve the performance of machine learning algorithms even when trained on a relatively small dataset.

Citation

Please cite as:

Spasić I, Krzemiński D, Corcoran P, Balinsky A

Cohort Selection for Clinical Trials From Longitudinal Patient Records: Text Mining Approach

JMIR Med Inform 2019;7(4):e15980

DOI: 10.2196/15980

PMID: 31674914

PMCID: 6913747

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Aug 28, 2019

Date Accepted: Oct 2, 2019

A text mining approach to cohort selection from longitudinal patient records

ABSTRACT

Citation

Copyright