Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Medical Informatics

Date Submitted: Jun 15, 2023
Open Peer Review Period: Jun 15, 2023 - Jun 30, 2023
Date Accepted: Mar 1, 2024
(closed for review but you can still tweet)

The final, peer-reviewed published version of this preprint can be found here:

A Case Demonstration of the Open Health Natural Language Processing Toolkit From the National COVID-19 Cohort Collaborative and the Researching COVID to Enhance Recovery Programs for a Natural Language Processing System for COVID-19 or Postacute Sequelae of SARS CoV-2 Infection: Algorithm Development and Validation

Wen A, Wang L, He H, Fu S, Liu S, Hanauer DA, Harris DR, Kavuluru R, Zhang R, Natarajan K, Pavinkurve NP, Hajagos J, Rajupet S, Lingam V, Saltz M, Elowsky C, Moffitt RA, Koraishy FM, Palchuk MB, Donovan J, Lingrey L, Stone-DerHargopian G, Miller RT, Williams AE, Leese PJ, Kovach PI, Pfaff ER, Zemmel M, Pates RD, Guthe N, Haendel MA, Chute CG, Liu H, National COVID Cohort Collaborative , the RECOVER Initiative

A Case Demonstration of the Open Health Natural Language Processing Toolkit From the National COVID-19 Cohort Collaborative and the Researching COVID to Enhance Recovery Programs for a Natural Language Processing System for COVID-19 or Postacute Sequelae of SARS CoV-2 Infection: Algorithm Development and Validation

JMIR Med Inform 2024;12:e49997

DOI: 10.2196/49997

PMID: 39250782

PMCID: 11420592

An NLP System for COVID/PASC: A Case Demonstration of the OHNLP Toolkit from the National COVID Cohort Collaborative and the RECOVER programs

  • Andrew Wen; 
  • Liwei Wang; 
  • Huan He; 
  • Sunyang Fu; 
  • Sijia Liu; 
  • David A Hanauer; 
  • Daniel R Harris; 
  • Ramakanth Kavuluru; 
  • Rui Zhang; 
  • Karthik Natarajan; 
  • Nishanth P Pavinkurve; 
  • Janos Hajagos; 
  • Sritha Rajupet; 
  • Veena Lingam; 
  • Mary Saltz; 
  • Corey Elowsky; 
  • Richard A Moffitt; 
  • Farrukh M Koraishy; 
  • Matvey B Palchuk; 
  • Jordan Donovan; 
  • Lora Lingrey; 
  • Garo Stone-DerHargopian; 
  • Robert T Miller; 
  • Andrew E Williams; 
  • Peter J Leese; 
  • Paul I Kovach; 
  • Emily R Pfaff; 
  • Mikhail Zemmel; 
  • Robert D Pates; 
  • Nick Guthe; 
  • Melissa A Haendel; 
  • Christopher G Chute; 
  • Hongfang Liu; 
  • National COVID Cohort Collaborative; 
  • the RECOVER Initiative

ABSTRACT

Background:

A wealth of clinically relevant information is only obtainable within unstructured clinical narratives, leading to great interest in clinical natural language processing (NLP). While a multitude of approaches to NLP exist, current algorithm development approaches have limitations that can slow the development process. These limitations are exacerbated when the task is novel or evolving, as is the case currently for NLP extraction of signs and symptoms of COVID-19 and post-acute sequelae of SARS CoV-2 Infection (PASC).

Objective:

To highlight current limitations of existing NLP algorithm development approaches that are exacerbated by NLP tasks surrounding novel and evolving clinical concepts, and to illustrate our approach to addressing these issues through the use case of developing an NLP system for signs and symptoms of COVID-19 and PASC.

Methods:

Two pre-existing studies on PASC were used as a baseline to determine a set of concepts that should be extracted by NLP. This concept list was then used in conjunction with the Unified Medical Language System (UMLS) to autonomously generate an expanded lexicon to weakly annotate a training set, that was then reviewed by a human expert to generate a fine-tuned NLP algorithm. The annotations from a fully human-annotated test set were then compared with NLP results from the fine-tuned algorithm. The NLP algorithm was then deployed to 10 additional sites also running our NLP infrastructure. Of these 10 sites, 5 were used to conduct a federated evaluation of the NLP algorithm.

Results:

An NLP algorithm consisting of 12,234 unique normalized text strings corresponding to 2,366 unique concepts was developed to extract COVID-19/PASC signs and symptoms. An unweighted mean dictionary coverage of 77.8% was found for the five sites.

Conclusions:

The evolutionary and time-critical nature of the PASC NLP task significantly complicates existing approaches to NLP algorithm development. In this work, we present a hybrid approach utilizing the Open Health Natural Language Processing (OHNLP) toolkit aimed at addressing these needs with a dictionary-based weak labeling step minimizing the need for additional expert annotation while still preserving the fine-tuning capabilities of expert involvement.


 Citation

Please cite as:

Wen A, Wang L, He H, Fu S, Liu S, Hanauer DA, Harris DR, Kavuluru R, Zhang R, Natarajan K, Pavinkurve NP, Hajagos J, Rajupet S, Lingam V, Saltz M, Elowsky C, Moffitt RA, Koraishy FM, Palchuk MB, Donovan J, Lingrey L, Stone-DerHargopian G, Miller RT, Williams AE, Leese PJ, Kovach PI, Pfaff ER, Zemmel M, Pates RD, Guthe N, Haendel MA, Chute CG, Liu H, National COVID Cohort Collaborative , the RECOVER Initiative

A Case Demonstration of the Open Health Natural Language Processing Toolkit From the National COVID-19 Cohort Collaborative and the Researching COVID to Enhance Recovery Programs for a Natural Language Processing System for COVID-19 or Postacute Sequelae of SARS CoV-2 Infection: Algorithm Development and Validation

JMIR Med Inform 2024;12:e49997

DOI: 10.2196/49997

PMID: 39250782

PMCID: 11420592

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.