Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Cancer

Date Submitted: Mar 11, 2022
Date Accepted: May 23, 2022

The final, peer-reviewed published version of this preprint can be found here:

Extracting Multiple Worries From Breast Cancer Patient Blogs Using Multilabel Classification With the Natural Language Processing Model Bidirectional Encoder Representations From Transformers: Infodemiology Study of Blogs

Watanabe T, Yada S, Aramaki E, Yajima H, Kizaki H, Hori S

Extracting Multiple Worries From Breast Cancer Patient Blogs Using Multilabel Classification With the Natural Language Processing Model Bidirectional Encoder Representations From Transformers: Infodemiology Study of Blogs

JMIR Cancer 2022;8(2):e37840

DOI: 10.2196/37840

PMID: 35657664

PMCID: 9206207

Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.

Extracting Multiple Worries from Breast Cancer Patient Blogs Using Multi-Label Classification With BERT: a Natural Language-Processing Model

  • Tomomi Watanabe; 
  • Shuntaro Yada; 
  • Eiji Aramaki; 
  • Hiroshi Yajima; 
  • Hayato Kizaki; 
  • Satoko Hori

ABSTRACT

Background:

Breast cancer patients have a variety of worries and need multifaceted information support, and their accumulated posts on social media contain rich descriptions of their daily worries concerning issues such as treatment, family and finances. It is important to identify these issues for helping breast cancer patients resolve their worries and obtain reliable information.

Objective:

This study aimed to extract and classify multiple worries from breast cancer patient-generated text using bidirectional encoder representations from transformers (BERT), a context-aware natural language processing model.

Methods:

A total of 2,272 blog posts by breast cancer patients in Japan, were collected. Five labels of worries; “treatment”, “physical”, “psychological”, “work/financial”, and “family/friends”, were defined, and multiple labels were allowed and assigned to each post. To assess the label criteria, fifty blog posts were randomly selected and annotated by two researchers with medical knowledge. After inter-annotator agreement (IAA) was assessed by Cohen’s kappa, one researcher annotated all the blog. A multi-label classifier that simultaneously predicts five worries in a text was developed using BERT. This classifier was fine-tuned by using the posts as input and adding a classification layer to the pre-trained BERT. The performance was evaluated for precision using the average of five-fold cross-validation results.

Results:

The number of blog posts were 477 for “treatment”, 1,138 for “physical”, 673 for “psychological”, 312 for “work/financial”, and 283 for “family/friends”. The IAA values were 0.67 for “treatment”, 0.76 for “physical”, 0.56 for “psychological”, 0.73 for “work/financial”, and 0.73 for “family/friends” and it indicated a high degree of agreement. The numbers of labels per blog post were 544 for no label posts, 892 for one label posts, and the 836 for posts with multiple labels, respectively. It was found that the worries varied by users, and the worries posted by the same user changed over time. The model performed well, however prediction performance differs for each label. The values of precision were 0.59 for “treatment”, 0.82 for “physical”, 0.64 for “psychological”, 0.67 for “work/financial”, and 0.58 for “family/friends”. The higher the IAA and the greater the number of posts, the higher the precision tended to be.

Conclusions:

This study showed that the BERT model can extract multiple worries from breast cancer patient-generated text. This is the first application of a multi-label classifier using the BERT model to extract multiple worries from patient-generated text. This creates will be helpful to identify breast cancer patients worries and give them social support on time.


 Citation

Please cite as:

Watanabe T, Yada S, Aramaki E, Yajima H, Kizaki H, Hori S

Extracting Multiple Worries From Breast Cancer Patient Blogs Using Multilabel Classification With the Natural Language Processing Model Bidirectional Encoder Representations From Transformers: Infodemiology Study of Blogs

JMIR Cancer 2022;8(2):e37840

DOI: 10.2196/37840

PMID: 35657664

PMCID: 9206207

Per the author's request the PDF is not available.