Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Formative Research

Date Submitted: Dec 7, 2022
Date Accepted: Apr 17, 2023

The final, peer-reviewed published version of this preprint can be found here:

Identifying Patient Populations in Texts Describing Drug Approvals Through Deep Learning–Based Information Extraction: Development of a Natural Language Processing Algorithm

Gendrin A, Souliotis L, Loudon-Griffiths J, Aggarwal R, Amoako D, Desouza G, Dimitrievska S, Metcalfe P, Louvet E, Sahni H

Identifying Patient Populations in Texts Describing Drug Approvals Through Deep Learning–Based Information Extraction: Development of a Natural Language Processing Algorithm

JMIR Form Res 2023;7:e44876

DOI: 10.2196/44876

PMID: 37347514

PMCID: 10337300

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Categorizing drug approval populations and matching their clinical trials using natural language processing: a practical case study fine-tuning BERT

  • Aline Gendrin; 
  • Leonidas Souliotis; 
  • James Loudon-Griffiths; 
  • Ravisha Aggarwal; 
  • Daniel Amoako; 
  • Gregory Desouza; 
  • Sashka Dimitrievska; 
  • Paul Metcalfe; 
  • Emilie Louvet; 
  • Harpreet Sahni

ABSTRACT

Background:

New drug treatments are regularly approved and it is challenging to remain up-to-date in this rapidly changing environment. A fast and accurate understanding is important to allow a global understanding of the drug market; automation of this information extraction provides a helpful starting point for the subject matter expert, helps to mitigate human errors, and saves time.

Objective:

We apply NLP methods to classify disease populations within the free text of oncology drug approval descriptions from the BioMedTracker database, and extract the clinical trial entities that provide evidence for these approvals.

Methods:

We fine-tune a BERT model. This methodology has demonstrated state of the art results on a wide variety of NLP tasks. Therefore, we also expect it to be stable or improve over time as we increase the amount of input data. BERT’s performance is validated against a rule-based text mining approach.

Results:

By utilizing our fine-tuned BERT models, we achieve 61% and 56% 5-fold cross-validated accuracies for the line of therapy and stage of cancer classification tasks, respectively; with five classes each, this is a marked increase when compared to random classification. For the trial identification named entity recognition (NER) task, the 5-fold cross-validated F1 score is currently 87%. The training dataset is small (~400 entries) and both classification and NER task scores are expected to improve over time with the availability of additional data. For clinical validation of the model, the results were corrected by a subject matter expert before usage. The subject matter expert leveraged the results for further analysis as a helpful starting point in a crowded clinical environment such as oncology.

Conclusions:

We developed a NLP algorithm that is currently assisting subject matter experts to extract stage of cancer, line of therapy and the relevant clinical trials that support these Health Authority approvals, from a free, unstructured text source. The increased structure these results bring can be further utilized in downstream applications, aiding searchability of relevant content against related drug project sources.


 Citation

Please cite as:

Gendrin A, Souliotis L, Loudon-Griffiths J, Aggarwal R, Amoako D, Desouza G, Dimitrievska S, Metcalfe P, Louvet E, Sahni H

Identifying Patient Populations in Texts Describing Drug Approvals Through Deep Learning–Based Information Extraction: Development of a Natural Language Processing Algorithm

JMIR Form Res 2023;7:e44876

DOI: 10.2196/44876

PMID: 37347514

PMCID: 10337300

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.