JMIR Preprints #38839: Data Exploration and Classification of News Article Reliability: A Deep Learning Study

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Data Exploration and Classification of News Article Reliability: A Deep Learning Study

Kevin Zhan;
Yutong Li;
Rafay Osmani;
Xiaoyu Wang;
Bo Cao

ABSTRACT

Background:

During the COVID-19 pandemic, we are exposed to large amounts of information each day. This “infodemic” is defined by the World Health Organization as the mass spread of misleading or false information during a pandemic. The spread of misinformation during the infodemic ultimately leads to misunderstandings of public health orders or direct opposition against public policies. While there have been efforts to combat misinformation spread, current manual fact-checking methods are insufficient to combat the infodemic.

Objective:

We propose the use of natural language processing (NLP) and machine learning (ML) techniques to build a model that can be used to identify unreliable articles online.

Methods:

We first preprocessed the ReCOVery dataset to obtain 2029 English news articles tagged with COVID-19 keywords from January to May 2020 which are labelled as reliable or unreliable. Data exploration was done to determine major differences between reliable and unreliable articles. We built an ensemble deep learning model using the body text, as well as features such as sentiment, Empath-derived lexical categories, and readability to classify the reliability.

Results:

We found that reliable news articles had a higher proportion of neutral sentiment while unreliable articles had a higher proportion of negative sentiment. Additionally, our analysis demonstrated that reliable articles are easier to read than unreliable articles in addition to having different lexical categories and keywords. Our new model was evaluated to achieve the following performance metrics: 0.906 AUC, 0.835 specificity, and 0.945 sensitivity, which is above the baseline performance of the original ReCOVery model.

Conclusions:

This paper identifies novel differences between reliable and unreliable articles; moreover, the model was trained using state-of-the-art deep learning techniques. We aim to be able to use our findings to help researchers and the public audience more easily identify false information and unreliable media in their everyday lives.

Citation

Please cite as:

Zhan K, Li Y, Osmani R, Wang X, Cao B

Data Exploration and Classification of News Article Reliability: Deep Learning Study

JMIR Infodemiology 2022;2(2):e38839

DOI: 10.2196/38839

PMID: 36193330

PMCID: 9516811

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: JMIR Infodemiology

Date Submitted: Apr 19, 2022

Open Peer Review Period: Apr 19, 2022 - Jun 14, 2022

Date Accepted: Sep 10, 2022

(closed for review but you can still tweet)

Data Exploration and Classification of News Article Reliability: A Deep Learning Study

ABSTRACT

Citation

Copyright