Extracting Multiple Worries from Breast Cancer Patient Blogs Using Multi-Label Classification with a Natural Language-Processing Model BERT (Bidirectional Encoder Representations from Transformers): Infodemiology Study of Blogs
ABSTRACT
Background:
Breast cancer patients have a variety of worries and need multifaceted information support, and their accumulated posts on social media contain rich descriptions of their daily worries concerning issues such as treatment, family and finances. It is important to identify these issues for helping breast cancer patients resolve their worries and obtain reliable information.
Objective:
This study aimed to extract and classify multiple worries from breast cancer patient-generated text using bidirectional encoder representations from transformers (BERT), a context-aware natural language processing model.
Methods:
A total of 2,272 blog posts by breast cancer patients in Japan, were collected. Five labels of worries; “treatment”, “physical”, “psychological”, “work/financial”, and “family/friends”, were defined, and multiple labels were allowed and assigned to each post. To assess the label criteria, fifty blog posts were randomly selected and annotated by two researchers with medical knowledge. After inter-annotator agreement (IAA) was assessed by Cohen’s kappa, one researcher annotated all the blog. A multi-label classifier that simultaneously predicts five worries in a text was developed using BERT. This classifier was fine-tuned by using the posts as input and adding a classification layer to the pre-trained BERT. The performance was evaluated for precision using the average of five-fold cross-validation results.
Results:
The number of blog posts were 477 for “treatment”, 1,138 for “physical”, 673 for “psychological”, 312 for “work/financial”, and 283 for “family/friends”. The IAA values were 0.67 for “treatment”, 0.76 for “physical”, 0.56 for “psychological”, 0.73 for “work/financial”, and 0.73 for “family/friends” and it indicated a high degree of agreement. The numbers of labels per blog post were 544 for no label posts, 892 for one label posts, and the 836 for posts with multiple labels, respectively. It was found that the worries varied by users, and the worries posted by the same user changed over time. The model performed well, however prediction performance differs for each label. The values of precision were 0.59 for “treatment”, 0.82 for “physical”, 0.64 for “psychological”, 0.67 for “work/financial”, and 0.58 for “family/friends”. The higher the IAA and the greater the number of posts, the higher the precision tended to be.
Conclusions:
This study showed that the BERT model can extract multiple worries from breast cancer patient-generated text. This is the first application of a multi-label classifier using the BERT model to extract multiple worries from patient-generated text. This creates will be helpful to identify breast cancer patients worries and give them social support on time.
Citation
Request queued. Please wait while the file is being generated. It may take some time.