Accepted for/Published in: JMIR Infodemiology
Date Submitted: Apr 19, 2022
Open Peer Review Period: Apr 19, 2022 - Jun 14, 2022
Date Accepted: Sep 10, 2022
(closed for review but you can still tweet)
Data Exploration and Classification of News Article Reliability: A Deep Learning Study
ABSTRACT
Background:
During the COVID-19 pandemic, we are exposed to large amounts of information each day. This “infodemic” is defined by the World Health Organization as the mass spread of misleading or false information during a pandemic. The spread of misinformation during the infodemic ultimately leads to misunderstandings of public health orders or direct opposition against public policies. While there have been efforts to combat misinformation spread, current manual fact-checking methods are insufficient to combat the infodemic.
Objective:
We propose the use of natural language processing (NLP) and machine learning (ML) techniques to build a model that can be used to identify unreliable articles online.
Methods:
We first preprocessed the ReCOVery dataset to obtain 2029 English news articles tagged with COVID-19 keywords from January to May 2020 which are labelled as reliable or unreliable. Data exploration was done to determine major differences between reliable and unreliable articles. We built an ensemble deep learning model using the body text, as well as features such as sentiment, Empath-derived lexical categories, and readability to classify the reliability.
Results:
We found that reliable news articles had a higher proportion of neutral sentiment while unreliable articles had a higher proportion of negative sentiment. Additionally, our analysis demonstrated that reliable articles are easier to read than unreliable articles in addition to having different lexical categories and keywords. Our new model was evaluated to achieve the following performance metrics: 0.906 AUC, 0.835 specificity, and 0.945 sensitivity, which is above the baseline performance of the original ReCOVery model.
Conclusions:
This paper identifies novel differences between reliable and unreliable articles; moreover, the model was trained using state-of-the-art deep learning techniques. We aim to be able to use our findings to help researchers and the public audience more easily identify false information and unreliable media in their everyday lives.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.