Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Nov 18, 2022
Date Accepted: Mar 14, 2023
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Tweeting for Health: An Infodemics Data Ecosystem for Real-Time Mining and AI-Based Analytics for Twitter
ABSTRACT
Background:
Digital misinformation, primarily on social media, has led to harmful and costly beliefs in the general population. Notably, these beliefs have resulted in public health crises to the detriment of governments around the world and their citizens. However, public health officials currently lack access to a comprehensive system capable of mining and analyzing large volumes of social media data in real time.
Objective:
The aim of this study was to design and develop a big data pipeline and ecosystem (UbiLab Infodemics Analysis System (U-IAS) for the identification and analysis of false information disseminated via social media on a certain topic or set of related topics.
Methods:
U-IAS is a platform-independent ecosystem developed in Python that leverages the Twitter V2 API and the Elastic Stack. The U-IAS expert system has 5 major components: a) Data Extraction Framework; b) Latent Dirichlet Allocation (LDA) Topic Model; c) Sentiment Analyzer; d) Information Disorder Classification Model; e) Elastic Cloud Deployment (Indexing of data and visualizations). The Data Extraction Framework queries data through the Twitter V2 API, with queries identified by public health experts. The LDA Topic Model, Sentiment Analyzer, and Information Disorder Classification Model are independently trained using a small, expert-validated subset of the extracted data. These models are then incorporated into U-IAS to analyze and classify the remaining data. Finally, the analyzed data is loaded into an index in the Elastic Cloud deployment and can then be presented in dashboards with advanced visualizations and analytics pertinent to infodemics analysis.
Results:
Each component in the system is performing as expected. The data extraction framework handles large loads of data within short periods of time. The LDA topic models have achieved relatively high coherence values (0.54) and the predicted topics are accurate and befitting to the data. The sentiment analyzer is performing at a correlation coefficient of 0.61 but could be improved in further iterations. The information disorder classifier has attained a satisfactory correlation coefficient of 0.76 against the expert-validated data. Moreover, the Elastic cloud deployment is efficient in its storage of data and comprehensive in its visualization and analytics capabilities. In fact, investigators have successfully utilized the system to extract interesting and important insights in public health.
Conclusions:
The novel U-IAS pipeline has the potential to detect and analyze misleading information related to a particular topic or set of related topics. Furthermore, this approach can emphasize on integrating social media data from multiple sources into dashboards for a multiplatform analysis and testing of the ecosystem on other public health use cases.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.