JMIR Preprints #50800: Empowering Clinical Trials with Natural Language Processing Models and Real-World Data: A Feasibility Study to Optimize Clinical Trial Eligibility Design with Data-driven Simulations

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Empowering Clinical Trials with Natural Language Processing Models and Real-World Data: A Feasibility Study to Optimize Clinical Trial Eligibility Design with Data-driven Simulations

Kyeryoung Lee;
Zongzhi Lui;
Yun Mai;
Mitchell K Higashi;
Tomi Jun;
Meng Ma;
Tongyu Wang;
Lei Ai;
Ediz Calay;
William Oh;
Gustavo Stolovitzky;
Eric Schadt;
Xiaoyan Wang

ABSTRACT

Background:

Clinical trials are vital for developing new therapies but can also delay drug development. Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines. Natural language processing (NLP) shows the potential in achieving these objectives.

Objective:

To assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients. This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records (EHRs) using deep-learning-based NLP technologies.

Methods:

We obtained 3,281 industry-sponsored, interventional phase 2 or 3 clinical trials recruiting patients with non-small cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn's disease from ClinicalTrials.gov, spanning between 2013 and 2020. A customized bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) based NLP pipeline was utilized to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms with their corresponding values. To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of non-small cell lung cancer patients (N=2,775), curated from Mount Sinai Healthcare System as a pilot study.

Results:

We manually annotated the clinical trial eligibility corpus (N=485 trials) and constructed an eligibility criteria-specific ontology. Our customized NLP pipeline, developed based on the eligibility-specific ontology we created through manual annotation, achieved high precision (0.91), recall (0.79), and F1 scores (0.83), enabling efficient extraction of granular criteria entities and relevant attributes from 3,281 clinical trials. A standardized eligibility criteria knowledge base, compatible with EHRs, was developed by transforming hypernym concepts into machine-interpretable hyponyms along with their corresponding values. Additionally, an interface prototype demonstrated the practicality of leveraging real-world data for optimizing clinical trial protocols and identifying eligible patients.

Conclusions:

Our customized NLP pipeline successfully generated a standardized eligibility criteria knowledge base, by transforming hypernym criteria into machine-readable hyponyms along with their corresponding values. A prototype interface integrating real-world patient information allows us to assess the impact of each eligibility criterion on the number of eligible patients. Leveraging NLP and real-world data in a data-driven approach holds promise for streamlining the overall clinical trial process, optimizing processes, and improving efficiency in patient identification.

Citation

Please cite as:

Lee K, Lui Z, Mai Y, Higashi MK, Jun T, Ma M, Wang T, Ai L, Calay E, Oh W, Stolovitzky G, Schadt E, Wang X

Optimizing Clinical Trial Eligibility Design Using Natural Language Processing Models and Real-World Data: Algorithm Development and Validation

JMIR AI 2024;3:e50800

DOI: 10.2196/50800

PMID: 39073872

PMCID: 11319878

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR AI

Date Submitted: Jul 16, 2023

Open Peer Review Period: Jul 12, 2023 - Sep 6, 2023

Date Accepted: Mar 23, 2024

(closed for review but you can still tweet)

Empowering Clinical Trials with Natural Language Processing Models and Real-World Data: A Feasibility Study to Optimize Clinical Trial Eligibility Design with Data-driven Simulations

ABSTRACT

Citation

Copyright