Accepted for/Published in: JMIR Formative Research
Date Submitted: Nov 9, 2021
Date Accepted: Apr 21, 2022
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Pretrained Transformer Language Models vs Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media
ABSTRACT
Background:
In recent years, social media has become a major channel for health-related information in Saudi Arabia. While social media makes accurate health-related information easily accessible to many people, it has become a channel for easily spreading health-related misinformation. Prior health informatics studies suggest that a large portion of health-related posts on social media are inaccurate. Given the subject matter and scale of such information, it is important to be able to automatically discriminate between accurate and inaccurate Arabic health-related posts.
Objective:
The first goal of this study is to generate a data set of generic health-related tweets in Arabic, labeled as either accurate or inaccurate health information. The second objective is to leverage this data set for training a state-of-the-art deep learning model for detecting the accuracy of Arabic health-related tweets. In particular, this study aims at training and comparing the performance of multiple deep-learning models that utilize pretrained word embeddings and transformer language models.
Methods:
We used 900 health-related tweets from a previously published data set and applied a pretrained model to extract an additional 900 health-related tweets from a second data set collected specifically for this study. The total of 1800 tweets were labeled by two doctors as “accurate,” “inaccurate,” or “unsure”. The doctors agreed on 779 tweets, which were labeled as either “accurate” or “inaccurate”. Nine variations of pretrained transformer language models were then trained and validated on 623 tweets (80% of the data set) and tested on 156 tweets (20% of the data set). For comparison, we also trained a bidirectional long short-term memory (BLSTM) model with seven different pretrained word embedding as the input layer on the same data set. The models were compared in terms of their accuracy, precision, recall, F1 score, and the macro average of the F1 score.
Results:
We constructed a data set of labeled tweets, 38% of which were labeled inaccurate health information, and 62% of which were labeled accurate health information. We did not include any tweets on which at least one of the annotators was unsure. Of the deep learning models investigated, the AraBERTv0.2 Large model achieved the best overall accuracy (approximately 87.8%), with an F1 score of 87%.
Conclusions:
Our results indicate that the pretrained language model AraBERTv0.2 is the best for classifying tweets as either inaccurate or accurate health information. Future studies should consider applying ensemble learning to combine the best models, since it may produce better results.
Citation