JMIR Preprints #34834: Pretrained Transformer Language Models vs Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Pretrained Transformer Language Models vs Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media

Yahya Albalawi;
Nikola S Nikolov;
Jim Buckley

ABSTRACT

Background:

In recent years, social media has become a major channel for health-related information in Saudi Arabia. While social media makes accurate health-related information easily accessible to many people, it has become a channel for easily spreading health-related misinformation. Prior health informatics studies suggest that a large portion of health-related posts on social media are inaccurate. Given the subject matter and scale of such information, it is important to be able to automatically discriminate between accurate and inaccurate Arabic health-related posts.

Objective:

The first goal of this study is to generate a data set of generic health-related tweets in Arabic, labeled as either accurate or inaccurate health information. The second objective is to leverage this data set for training a state-of-the-art deep learning model for detecting the accuracy of Arabic health-related tweets. In particular, this study aims at training and comparing the performance of multiple deep-learning models that utilize pretrained word embeddings and transformer language models.

Methods:

We used 900 health-related tweets from a previously published data set and applied a pretrained model to extract an additional 900 health-related tweets from a second data set collected specifically for this study. The total of 1800 tweets were labeled by two doctors as “accurate,” “inaccurate,” or “unsure”. The doctors agreed on 779 tweets, which were labeled as either “accurate” or “inaccurate”. Nine variations of pretrained transformer language models were then trained and validated on 623 tweets (80% of the data set) and tested on 156 tweets (20% of the data set). For comparison, we also trained a bidirectional long short-term memory (BLSTM) model with seven different pretrained word embedding as the input layer on the same data set. The models were compared in terms of their accuracy, precision, recall, F1 score, and the macro average of the F1 score.

Results:

We constructed a data set of labeled tweets, 38% of which were labeled inaccurate health information, and 62% of which were labeled accurate health information. We did not include any tweets on which at least one of the annotators was unsure. Of the deep learning models investigated, the AraBERTv0.2 Large model achieved the best overall accuracy (approximately 87.8%), with an F1 score of 87%.

Conclusions:

Our results indicate that the pretrained language model AraBERTv0.2 is the best for classifying tweets as either inaccurate or accurate health information. Future studies should consider applying ensemble learning to combine the best models, since it may produce better results.

Citation

Please cite as:

Albalawi Y, Nikolov NS, Buckley J

Pretrained Transformer Language Models Versus Pretrained Word Embeddings for the Detection of Accurate Health Information on Arabic Social Media: Comparative Study

JMIR Form Res 2022;6(6):e34834

DOI: 10.2196/34834

PMID: 35767322

PMCID: 9280463