Accepted for/Published in: JMIR Medical Informatics
Date Submitted: May 29, 2020
Date Accepted: Oct 4, 2020
Date Submitted to PubMed: May 29, 2020
Development and Validation of a Natural Language Processing Algorithm for Surveillance of Cervical and Anal Cancer and Precancer: A Split-validation Study
ABSTRACT
Background:
Accurate identification of new diagnoses of human papillomavirus (HPV) associated cancers and precancers is an important step towards the development of strategies that optimize the use of HPV vaccines. The diagnosis of HPV cancers hinges on a histopathologic report, which is typically stored in the electronic medical records (EMR) as free-form, or unstructured narrative text. Previous efforts to perform surveillance for HPV cancers have relied on the manual review of pathology reports to extract diagnostic information, a process that is both labor- and resource-intensive. Natural Language Processing (NLP) can be used to automate the structuring and extraction of clinical data from unstructured narrative text in medical records and may provide a practical and effective method for identifying patients with HPV vaccine-preventable disease for surveillance and research.
Objective:
The objective of this study was to develop and assess the accuracy of a NLP algorithm for identification of individuals with cancer or pre-cancer of the cervix and anus.
Methods:
A pipeline-based NLP algorithm was developed, which incorporated both machine-learning and rule-based methods to extract diagnostic elements from the narrative pathology reports. To test the algorithm’s classification accuracy, we used a split-validation study design. Full length cervical and anal pathology reports were randomly selected from 4 clinical pathology laboratories. Two study team members, blinded to the classifications produced by the NLP algorithm, manually and independently reviewed all reports and classified them at the document level according to two domains (diagnosis and HPV testing results). Using the manual review as the “gold-standard”, the algorithm’s performance was evaluated using standard measurements of accuracy (true positive and true negatives / total number of reports), recall (true positives / positive reports by gold-standard), precision (true positives / positive reports by NLP), and f-measure (harmonic mean of recall and precision).
Results:
The NLP algorithm’s performance was validated on 949 pathology reports (anal cytology = 105, anal histology = 95, cervical cytology = 449, and cervical histology = 404). The NLP algorithm demonstrated accurate identification of abnormal cytology, histology, and positive HPV tests with accuracies greater than 0.91 in all specimens. Precision (also known as positive predictive value) was lowest for anal histology reports (0.87; 95%CI = 0.59–0.98) and highest for cervical cytology (0.98; 95%CI= 0.95–0.99). The NLP algorithm missed two out of the 15 abnormal anal histology reports, which led to the relatively low recall/sensitivity (0.68; 95%CI= 0.43–0.87).
Conclusions:
This study outlines the development and validity testing of a freely available and easily implementable NLP algorithm that is able to automate the extraction and classification of clinical data from cervical and anal cytology and histology reports with high accuracy. Clinical Trial: N/A
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.