Accepted for/Published in: JMIR AI
Date Submitted: Jul 16, 2023
Open Peer Review Period: Jul 12, 2023 - Sep 6, 2023
Date Accepted: Mar 23, 2024
(closed for review but you can still tweet)
Empowering Clinical Trials with Natural Language Processing Models and Real-World Data: A Feasibility Study to Optimize Clinical Trial Eligibility Design with Data-driven Simulations
ABSTRACT
Background:
Clinical trials are vital for developing new therapies but can also delay drug development. Efficient trial data management, optimized trial protocol, and accurate patient identification are critical for reducing trial timelines. Natural language processing (NLP) shows the potential in achieving these objectives.
Objective:
To assess the feasibility of using data-driven approaches to optimize clinical trial protocol design and identify eligible patients. This involves creating a comprehensive eligibility criteria knowledge base integrated within electronic health records (EHRs) using deep-learning-based NLP technologies.
Methods:
We obtained 3,281 industry-sponsored, interventional phase 2 or 3 clinical trials recruiting patients with non-small cell lung cancer, prostate cancer, breast cancer, multiple myeloma, ulcerative colitis, and Crohn's disease from ClinicalTrials.gov, spanning between 2013 and 2020. A customized bidirectional long short-term memory (BiLSTM) and conditional random field (CRF) based NLP pipeline was utilized to extract all eligibility criteria attributes and convert hypernym concepts into computable hyponyms with their corresponding values. To illustrate the simulation of clinical trial design for optimization purposes, we selected a subset of non-small cell lung cancer patients (N=2,775), curated from Mount Sinai Healthcare System as a pilot study.
Results:
We manually annotated the clinical trial eligibility corpus (N=485 trials) and constructed an eligibility criteria-specific ontology. Our customized NLP pipeline, developed based on the eligibility-specific ontology we created through manual annotation, achieved high precision (0.91), recall (0.79), and F1 scores (0.83), enabling efficient extraction of granular criteria entities and relevant attributes from 3,281 clinical trials. A standardized eligibility criteria knowledge base, compatible with EHRs, was developed by transforming hypernym concepts into machine-interpretable hyponyms along with their corresponding values. Additionally, an interface prototype demonstrated the practicality of leveraging real-world data for optimizing clinical trial protocols and identifying eligible patients.
Conclusions:
Our customized NLP pipeline successfully generated a standardized eligibility criteria knowledge base, by transforming hypernym criteria into machine-readable hyponyms along with their corresponding values. A prototype interface integrating real-world patient information allows us to assess the impact of each eligibility criterion on the number of eligible patients. Leveraging NLP and real-world data in a data-driven approach holds promise for streamlining the overall clinical trial process, optimizing processes, and improving efficiency in patient identification.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.