JMIR Preprints #99795: Linguistic Markers of COVID-19 Misinformation on X/Twitter: Machine Learning Analysis of Fake News, Non-Fake News, and Official Public Health Communication

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Linguistic Markers of COVID-19 Misinformation on X/Twitter: Machine Learning Analysis of Fake News, Non-Fake News, and Official Public Health Communication

Peter Yee-Lap To;
Brian Chun

ABSTRACT

Background:

Health misinformation on social media remains a public health concern because false or misleading content can spread rapidly during crises and may undermine adherence to evidence-based guidance. Prior work suggests that emotionally salient, interesting, and cognitively accessible content is more likely to attract engagement; however, less is known about whether interpretable linguistic features can distinguish COVID-19 fake news from official public health communication and from non-fake news.

Objective:

This study examined whether affective, lexical, and discourse-level linguistic features can classify COVID-19 fake news on X/Twitter and identify which features provide the strongest single-feature discrimination.

Methods:

This observational computational study combined COVID-19 tweets from official public health and institutional accounts with publicly available fake news and non-fake news datasets. Official communications were collected from 12 English-language accounts between December 2019 and December 2022 and filtered using COVID-19 keyword criteria. Fake news and non-fake news tweets were drawn from the COVID Rumor, CONSTRAINT/AAAI, CoAID, and TruthSeeker datasets. The final analytic corpus included 25,181 official communication tweets, 22,424 fake news tweets, and 5,415 non-fake news tweets. Features included sentiment, valence, arousal, dominance, word frequency ranks, and GisPy-derived discourse features such as referential cohesion, semantic and WordNet verb overlap, imageability, concreteness, and hypernymy. Two binary classification tasks were conducted: fake news versus official communication and fake news versus non-fake news. Random oversampling was used to balance classes. Single-feature decision stumps were used for interpretable feature ranking, and decision tree, AdaBoost, and Light Gradient Boosting Machine classifiers were fitted using all features.

Results:

For fake news versus official communication, the strongest single feature was transformer-based sentiment (accuracy=0.7699), followed by number of paragraphs (accuracy=0.7481), referential cohesion (accuracy=0.7251), and number of sentences (accuracy=0.6915). Using all features, testing accuracy was 0.8703 for the decision tree, 0.9186 for AdaBoost, and 0.9583 for Light Gradient Boosting Machine. For fake news versus non-fake news, single-feature accuracies were lower; the strongest features were median content-word rank (accuracy=0.6722) and arousal (accuracy=0.6432). Using all features, testing accuracy was 0.7379 for the decision tree, 0.8258 for AdaBoost, and 0.9655 for Light Gradient Boosting Machine.

Conclusions:

Interpretable linguistic features distinguished COVID-19 fake news from official public health communication and, to a lesser extent at the single-feature level, from non-fake news. Sentiment, text segmentation, referential cohesion, word rank, and arousal appear particularly relevant. The findings support further evaluation of transparent linguistic markers for infodemiology and public health communication research, but external validation, temporal evaluation, and models that include account-level and engagement-level features are needed before operational deployment.

Citation

Please cite as:

To PYL, Chun B

Linguistic Markers of COVID-19 Misinformation on X/Twitter: Machine Learning Analysis of Fake News, Non-Fake News, and Official Public Health Communication

JMIR Preprints. 29/04/2026:99795

DOI: 10.2196/preprints.99795

URL: https://preprints.jmir.org/preprint/99795

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Currently submitted to: JMIR Infodemiology

Date Submitted: Apr 29, 2026

Open Peer Review Period: May 12, 2026 - Jul 7, 2026

(closed for review but you can still tweet)

NOTE: This is an unreviewed Preprint

Linguistic Markers of COVID-19 Misinformation on X/Twitter: Machine Learning Analysis of Fake News, Non-Fake News, and Official Public Health Communication

ABSTRACT

Citation

Copyright