Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Infodemiology

Date Submitted: Apr 14, 2022
Date Accepted: Aug 8, 2022

The final, peer-reviewed published version of this preprint can be found here:

COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic

Kolluri NL, Liu Y, Murthy D

COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic

JMIR Infodemiology 2022;2(2):e38756

DOI: 10.2196/38756

PMID: 37113446

PMCID: 9987189

COVID-19 Misinformation Detection: Machine Learned Solutions to the Infodemic

  • Nikhil Leland Kolluri; 
  • Yunong Liu; 
  • Dhiraj Murthy

ABSTRACT

Background:

The volume of Coronavirus Disease 2019 (COVID-19)-related misinformation has long exceeded the resources available to fact checkers to effectively mitigate its ill effects. Automated and web-based approaches can provide effective deterrents to online misinformation. Machine-learning-based methods achieved robust performance on text classification tasks, including potentially low quality news credibility assessment. Despite the progress of initial, rapid interventions, the enormity of COVID-19-related misinformation continues to overwhelm fact checkers. Therefore, improvement in automated and machine-learned methods for infodemic response is urgently needed.

Objective:

Improvement in automated and machine-learned methods for infodemic response.

Methods:

We evaluated three strategies for training a machine learning model to determine the highest model performance: (1) COVID-19-related fact-checked data only, (2) general fact-checked data only, (3) combined COVID-19 and general fact-checked data. We created two COVID-19-related misinformation datasets from fact-checked, ‘false’ content combined with programmatically-retrieved ‘true’ content. The first set contained ~7,000 entries from July - August, 2020, and the second contained ~31,000 entries from January, 2020 - June, 2022. We crowdsourced 31,441 votes in order to human label the first dataset.

Results:

The models achieved an accuracy of 96.55% and 94.56% on the first and second external validation dataset, respectively. Our best-performing model was developed using COVID-19-specific content. We were able to successfully develop combined models that outperformed human votes of misinformation. Specifically, when we blended our model predictions with human votes, the highest accuracy we achieved on the first external validation dataset was 99.1%. When we considered outputs where the machine learning model agreed with human votes, we achieved accuracies up to 98.59% on the first validation dataset, which outperformed human votes alone, which had an accuracy of just 73%.

Conclusions:

External validation accuracies of 96.55% and 94.56% are evidence that machine learning can produce superior results for the difficult task of classifying the veracity of COVID-19 content. Pre-trained language models (PLMs) performed best when fine-tuned on a topic-specific dataset, while other models achieved their best accuracy when fine-tuned on a combination of topic-specific and general-topic datasets. Crucially, our study found that blended models, trained/fine-tuned on general-topic content with crowdsourced data, improved our models' accuracies up to 99.7%. The successful use of crowdsourced data can increase the accuracy of models in situations when expert-labeled data is scarce. The 98.59% accuracy on a “high-confidence” subsection comprised of machine learned and human labels suggests that crowdsourced votes can optimize machine-learned labels to improve accuracy above human-only levels. These results support the utility of supervised machine learning to deter and combat future health-related disinformation.


 Citation

Please cite as:

Kolluri NL, Liu Y, Murthy D

COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic

JMIR Infodemiology 2022;2(2):e38756

DOI: 10.2196/38756

PMID: 37113446

PMCID: 9987189

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.