JMIR Preprints #27210: caBERTnet: A Question-and-Answer System to Extract Data from Free-Text Oncological Pathology Reports

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

caBERTnet: A Question-and-Answer System to Extract Data from Free-Text Oncological Pathology Reports

Joseph Ross Mitchell;
Phillip Szepietowski;
Rachel Howard;
Phillip Reisman;
Jennie D. Jones;
Patricia Lewis;
Brooke L. Fridley;
Dana E. Rollison

ABSTRACT

Background:

Information in pathology reports is critical for cancer care. Natural language processing (NLP) systems to extract information from pathology reports are often narrow in scope or require extensive tuning. Consequently, there is growing interest in automated deep learning approaches. A powerful new NLP algorithm, Bidirectional Encoder Representations from Transformers (BERT), was published in late 2018. BERT set new performance standards on tasks as diverse as question-answering, named entity recognition, speech recognition, and more.

Objective:

to develop a BERT-based system to automatically extract detailed tumor site and histology information from free text pathology reports.

Methods:

We pursued three specific aims: 1) extract accurate tumor site and histology descriptions from free-text pathology reports; 2) accommodate the diverse terminology used to indicate the same pathology; and 3) provide accurate standardized tumor site and histology codes for use by downstream applications. We first trained a base language-model to comprehend the technical language in pathology reports. This involved unsupervised learning on a training corpus of 275,605 electronic pathology reports from 164,531 unique patients that included 121 million words. Next, we trained a Q&A “head” that would connect to, and work with, the pathology language model to answer pathology questions. Our Q&A system was designed to search for the answers to two predefined questions in each pathology report: 1) “What organ contains the tumor?”; and, 2) “What is the kind of tumor or carcinoma?”. This involved supervised training on 8,197 pathology reports, each with ground truth answers to these two questions determined by Certified Tumor Registrars. The dataset included 214 tumor sites and 193 histologies. The tumor site and histology phrases extracted by the Q&A model were used to predict ICD-O-3 site and histology codes. This involved fine-tuning two additional BERT models: one to predict site codes, and the second to predict histology codes. Our final system includes a network of 3 BERT-based models. We call this caBERTnet (pronounced “Cabernet”). We evaluated caBERnet using a sequestered test dataset of 2,050 pathology reports with ground truth answers determined by Certified Tumor Registrars.

Results:

caBERTnet’s accuracies for predicting group-level site and histology codes were 93.5% and 97.7%, respectively. The top-5 accuracies for predicting fine-grained ICD-O-3 site and histology codes with 5 or more samples each in the training dataset were 93.6% and 95.4%, respectively.

Conclusions:

This is the first time an NLP system has achieved expert-level performance predicting ICD-O-3 codes across a broad range of tumor sites and histologies. Our new system could help reduce treatment delays, increase enrollment in clinical trials of new therapies, and improve patient outcomes.

Citation

Please cite as:

Mitchell JR, Szepietowski P, Howard R, Reisman P, Jones JD, Lewis P, Fridley BL, Rollison DE

A Question-and-Answer System to Extract Data From Free-Text Oncological Pathology Reports (CancerBERT Network): Development Study

J Med Internet Res 2022;24(3):e27210

DOI: 10.2196/27210

PMID: 35319481

PMCID: 8987958

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Jan 16, 2021

Open Peer Review Period: Jan 15, 2021 - Mar 12, 2021

Date Accepted: Nov 10, 2021

(closed for review but you can still tweet)

caBERTnet: A Question-and-Answer System to Extract Data from Free-Text Oncological Pathology Reports

ABSTRACT

Citation

Copyright