Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Jun 15, 2023
Open Peer Review Period: Jun 15, 2023 - Jun 30, 2023
Date Accepted: Mar 1, 2024
(closed for review but you can still tweet)
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
An NLP System for COVID/PASC: A Case Demonstration of the OHNLP Toolkit from the National COVID Cohort Collaborative and the RECOVER programs
ABSTRACT
Background:
A wealth of clinically relevant information is only obtainable within unstructured clinical narratives, leading to great interest in clinical natural language processing (NLP). While a multitude of approaches to NLP exist, current algorithm development approaches have limitations that can slow the development process. These limitations are exacerbated when the task is novel or evolving, as is the case currently for NLP extraction of signs and symptoms of COVID-19 and post-acute sequelae of SARS CoV-2 Infection (PASC).
Objective:
To highlight current limitations of existing NLP algorithm development approaches that are exacerbated by NLP tasks surrounding novel and evolving clinical concepts, and to illustrate our approach to addressing these issues through the use case of developing an NLP system for signs and symptoms of COVID-19 and PASC.
Methods:
Two pre-existing studies on PASC were used as a baseline to determine a set of concepts that should be extracted by NLP. This concept list was then used in conjunction with the Unified Medical Language System (UMLS) to autonomously generate an expanded lexicon to weakly annotate a training set, that was then reviewed by a human expert to generate a fine-tuned NLP algorithm. The annotations from a fully human-annotated test set were then compared with NLP results from the fine-tuned algorithm. The NLP algorithm was then deployed to 10 additional sites also running our NLP infrastructure. Of these 10 sites, 5 were used to conduct a federated evaluation of the NLP algorithm.
Results:
An NLP algorithm consisting of 12,234 unique normalized text strings corresponding to 2,366 unique concepts was developed to extract COVID-19/PASC signs and symptoms. An unweighted mean dictionary coverage of 77.8% was found for the five sites.
Conclusions:
The evolutionary and time-critical nature of the PASC NLP task significantly complicates existing approaches to NLP algorithm development. In this work, we present a hybrid approach utilizing the Open Health Natural Language Processing (OHNLP) toolkit aimed at addressing these needs with a dictionary-based weak labeling step minimizing the need for additional expert annotation while still preserving the fine-tuning capabilities of expert involvement.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.