Accepted for/Published in: JMIR Medical Informatics
Date Submitted: Feb 2, 2021
Date Accepted: Jun 20, 2021
Date Submitted to PubMed: Aug 4, 2021
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Adapting for Informal Language in Arabic Twitter Improves Monitoring of COVID-19 Pandemic and Influenza Epidemic
ABSTRACT
Background:
Twitter is a real time messaging platform widely used by people and organisations to share information on many topics. It could potentially be useful to analyse tweets for infectious disease monitoring purposes in order to reduce reporting lag time, and to provide an independent complementary source of data, compared to traditional approaches. However, such analysis is currently not possible in the Arabic speaking world due to lack of basic building blocks for research.
Objective:
We collect around 4,000 Arabic tweets related to COVID-19 and Influenza. We clean and label the tweets relative to the Arabic Infectious Diseases Ontology which includes non-standard terminology and 11 core concepts and 21 relations. The aim of this study is to analyse Arabic tweets to estimate their usefulness for health surveillance, understand the impact of the informal terms in the analysis, show the effect of the deep learning methods in the classification process, and identify the locations where the infection is spreading.
Methods:
We apply multi-label classification techniques: Binary Relevance, Classifier Chains, Label Powerset, Adapted Algorithm (MLKNN), NBSVM, BERT, and AraBERT to identify infected people. We also use Named Entity Recognition to predict the locations affected.
Results:
We achieve an F1-score up to 88% in the Influenza case study and 94% in the COVID-19 one. Adapting for non-standard terminology and informal language helps to improve accuracy by as much as 15% with an average improvement of 8%. Deep learning methods achieve around 5% on hamming loss during the classifying process. Our geo-location detection algorithm can predict on average 54% accuracy for the location of the users using tweet content.
Conclusions:
This study identifies two Arabic social media datasets for monitoring tweets related to Influenza and COVID-19. It demonstrates the importance of including informal terms, which is regularly used by social media users, in the analysis. It also proves that BERT achieves good results when used with new terms in COVID-19 tweets. Finally, the tweet content may contain useful information to determine the location of the disease spread.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.