JMIR Preprints #37840: Extracting Multiple Worries from Breast Cancer Patient Blogs Using Multi-Label Classification With BERT: a Natural Language-Processing Model

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Extracting Multiple Worries from Breast Cancer Patient Blogs Using Multi-Label Classification With BERT: a Natural Language-Processing Model

Tomomi Watanabe;
Shuntaro Yada;
Eiji Aramaki;
Hiroshi Yajima;
Hayato Kizaki;
Satoko Hori

ABSTRACT

Background:

Breast cancer patients have a variety of worries and need multifaceted information support, and their accumulated posts on social media contain rich descriptions of their daily worries concerning issues such as treatment, family and finances. It is important to identify these issues for helping breast cancer patients resolve their worries and obtain reliable information.

Objective:

This study aimed to extract and classify multiple worries from breast cancer patient-generated text using bidirectional encoder representations from transformers (BERT), a context-aware natural language processing model.

Methods:

A total of 2,272 blog posts by breast cancer patients in Japan, were collected. Five labels of worries; “treatment”, “physical”, “psychological”, “work/financial”, and “family/friends”, were defined, and multiple labels were allowed and assigned to each post. To assess the label criteria, fifty blog posts were randomly selected and annotated by two researchers with medical knowledge. After inter-annotator agreement (IAA) was assessed by Cohen’s kappa, one researcher annotated all the blog. A multi-label classifier that simultaneously predicts five worries in a text was developed using BERT. This classifier was fine-tuned by using the posts as input and adding a classification layer to the pre-trained BERT. The performance was evaluated for precision using the average of five-fold cross-validation results.

Results:

The number of blog posts were 477 for “treatment”, 1,138 for “physical”, 673 for “psychological”, 312 for “work/financial”, and 283 for “family/friends”. The IAA values were 0.67 for “treatment”, 0.76 for “physical”, 0.56 for “psychological”, 0.73 for “work/financial”, and 0.73 for “family/friends” and it indicated a high degree of agreement. The numbers of labels per blog post were 544 for no label posts, 892 for one label posts, and the 836 for posts with multiple labels, respectively. It was found that the worries varied by users, and the worries posted by the same user changed over time. The model performed well, however prediction performance differs for each label. The values of precision were 0.59 for “treatment”, 0.82 for “physical”, 0.64 for “psychological”, 0.67 for “work/financial”, and 0.58 for “family/friends”. The higher the IAA and the greater the number of posts, the higher the precision tended to be.

Conclusions:

This study showed that the BERT model can extract multiple worries from breast cancer patient-generated text. This is the first application of a multi-label classifier using the BERT model to extract multiple worries from patient-generated text. This creates will be helpful to identify breast cancer patients worries and give them social support on time.

Citation

Please cite as:

Watanabe T, Yada S, Aramaki E, Yajima H, Kizaki H, Hori S

Extracting Multiple Worries From Breast Cancer Patient Blogs Using Multilabel Classification With the Natural Language Processing Model Bidirectional Encoder Representations From Transformers: Infodemiology Study of Blogs

JMIR Cancer 2022;8(2):e37840

DOI: 10.2196/37840

PMID: 35657664

PMCID: 9206207