Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Jun 14, 2023
Date Accepted: Aug 23, 2023
Bridging the Gap: A Multidisciplinary Approach to Developing and Validating a Natural Language Processing Model for COVID-19 Detection Based on Dutch General Practice Electronic Health Records Using Bidirectional Encoder Representations from Transformers (BERT)
ABSTRACT
Background:
Natural language processing (NLP) models, like bidirectional encoder representations from transformers (BERT), hold promise in revolutionizing disease identification from electronic health records (EHRs) by potentially enhancing efficiency and accuracy. However, their practical application in practice settings demands a comprehensive and multidisciplinary approach to development and validation. The COVID-19 pandemic highlighted challenges in disease identification due to limited testing availability and challenges in handling unstructured data. In the Netherlands, where general practitioners (GPs) serve as the first point of contact for healthcare, EHR data generated by these primary care providers contains a wealth of potentially valuable information. Nonetheless, the unstructured nature of free-text entries in EHRs poses challenges in identifying trends, detecting disease outbreaks, or accurately pinpointing COVID-19 cases.
Objective:
This study aimed to develop and validate a BERT model for detecting COVID-19 consultations in general practice EHRs in the Netherlands.
Methods:
The BERT model was initially pre-trained on Dutch language data and fine-tuned using a comprehensive EHR dataset comprising confirmed COVID-19 GP consultations and non-COVID-19 related consultations. The dataset was partitioned into a training and development set, and the model’s performance was evaluated on an independent test set that served as the primary measure if its effectiveness in COVID-19 detection. To validate the final model, its performance was assessed through three approaches. Firstly, external validation was applied on an EHR dataset from a different geographic region in the Netherlands. Secondly, validation was conducted using results of polymerase chain reaction (PCR) test data obtained from municipal health services. Lastly, correlation between predicted outcomes and COVID-19 related hospitalizations in the Netherlands was assessed, encompassing the period around the outbreak of the pandemic in the Netherlands, i.e., the period before widespread testing.
Results:
Model development used 300,359 general practitioner consultations. We developed a highly accurate model for COVID-19 consultations (accuracy 0·97, F1 score 0·90, precision 0·85, recall 0·85, specificity 0·99). External validations showed comparable high performance. Validation on PCR test data showed high recall but lower precision and specificity. Validation using hospital data showed significant correlation between COVID-19 predictions of the model and COVID-19 related hospitalizations (F1 score: 96·8, P-value:< 0.001, R-squared: 0·69). Most importantly, the model was able to predict COVID-19 cases weeks before the first confirmed case in the Netherlands.
Conclusions:
The developed BERT model showed to be able to accurately identify COVID-19 cases among general practitioner consultations even preceding confirmed cases. The validated efficacy of our BERT model highlights the potential of NLP models to identify disease outbreaks early, exemplifying the power of multidisciplinary efforts in harnessing technology for disease identification. Moreover, the implications of this study extend beyond COVID-19 and offers a blueprint for the early recognition of various illnesses, revealing that such models could revolutionize disease surveillance.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.