Accepted for/Published in: JMIR Infodemiology
Date Submitted: Mar 30, 2022
Date Accepted: Nov 30, 2022
Warning: This is an author submission that is not peer-reviewed or edited. Preprints - unless they show as "accepted" - should not be relied on to guide clinical practice or health-related behavior and should not be reported in news media as established information.
Detecting Tweets Containing Cannabidiol-Related COVID-19 Misinformation Using Transformer Language Models and FDA Warning Letters
ABSTRACT
Background:
The COVID-19 pandemic introduced yet another medical condition for online sellers of loosely regulated substances such as cannabidiol (CBD) to falsely promote sales. As a result, it has become necessary to innovate ways to identify such instances of misinformation.
Objective:
We used transformer-based language models to identify COVID-19 misinformation as it relates to the sales and/or promotion of CBD, by finding tweets that are semantically similar to quotes taken from known instances of misinformation, specifically the publicly available FDA warning letters.
Methods:
We collected tweets using CBD and COVID-19 related terms. Using a previously trained model, we extracted the tweets indicating commercialization/sales of CBD, and annotated those containing COVID-19 misinformation, according to the FDA’s definitions. We encoded the collection of tweets and misinformation quotes into sentence vectors, and then calculated the cosine similarity between each quote and each tweet, so that a threshold could be established to identify tweets that are making false claims regarding CBD and COVID-19, while minimizing the instance of false-positives.
Results:
We demonstrated that by using quotes taken from FDA warning letters of known offenses we can identify semantically similar tweets that also contain similar misinformation. By identifying a cosine distance threshold between the sentence vector of the warning letters and the sentence vector of the tweets, we can identify tweets that contain similar forms of misinformation.
Conclusions:
Our framework shows that commercial CBD/COVID-19 misinformation can potentially be identified and consequently curbed by using transformer-based language models and known prior instances of misinformation. Our approach functions without need for labeled data, potentially reducing the time in which misinformation could be identified. Our proposed framework shows promise in being easily adapted to identify other forms of misinformation related to loosely regulated substances, such as that related to autism, dementia, and Alzheimer’s disease.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.