JMIR Preprints #44876: Categorizing drug approval populations and matching their clinical trials using natural language processing: a practical case study fine-tuning BERT

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Categorizing drug approval populations and matching their clinical trials using natural language processing: a practical case study fine-tuning BERT

Aline Gendrin;
Leonidas Souliotis;
James Loudon-Griffiths;
Ravisha Aggarwal;
Daniel Amoako;
Gregory Desouza;
Sashka Dimitrievska;
Paul Metcalfe;
Emilie Louvet;
Harpreet Sahni

ABSTRACT

Background:

New drug treatments are regularly approved and it is challenging to remain up-to-date in this rapidly changing environment. A fast and accurate understanding is important to allow a global understanding of the drug market; automation of this information extraction provides a helpful starting point for the subject matter expert, helps to mitigate human errors, and saves time.

Objective:

We apply NLP methods to classify disease populations within the free text of oncology drug approval descriptions from the BioMedTracker database, and extract the clinical trial entities that provide evidence for these approvals.

Methods:

We fine-tune a BERT model. This methodology has demonstrated state of the art results on a wide variety of NLP tasks. Therefore, we also expect it to be stable or improve over time as we increase the amount of input data. BERT’s performance is validated against a rule-based text mining approach.

Results:

By utilizing our fine-tuned BERT models, we achieve 61% and 56% 5-fold cross-validated accuracies for the line of therapy and stage of cancer classification tasks, respectively; with five classes each, this is a marked increase when compared to random classification. For the trial identification named entity recognition (NER) task, the 5-fold cross-validated F1 score is currently 87%. The training dataset is small (~400 entries) and both classification and NER task scores are expected to improve over time with the availability of additional data. For clinical validation of the model, the results were corrected by a subject matter expert before usage. The subject matter expert leveraged the results for further analysis as a helpful starting point in a crowded clinical environment such as oncology.

Conclusions:

We developed a NLP algorithm that is currently assisting subject matter experts to extract stage of cancer, line of therapy and the relevant clinical trials that support these Health Authority approvals, from a free, unstructured text source. The increased structure these results bring can be further utilized in downstream applications, aiding searchability of relevant content against related drug project sources.

Citation

Please cite as:

Gendrin A, Souliotis L, Loudon-Griffiths J, Aggarwal R, Amoako D, Desouza G, Dimitrievska S, Metcalfe P, Louvet E, Sahni H

Identifying Patient Populations in Texts Describing Drug Approvals Through Deep Learning–Based Information Extraction: Development of a Natural Language Processing Algorithm

JMIR Form Res 2023;7:e44876

DOI: 10.2196/44876

PMID: 37347514

PMCID: 10337300

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Formative Research

Date Submitted: Dec 7, 2022

Date Accepted: Apr 17, 2023

Categorizing drug approval populations and matching their clinical trials using natural language processing: a practical case study fine-tuning BERT

ABSTRACT

Citation

Copyright