Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Mar 13, 2023
Date Accepted: Aug 31, 2023
Identifying potential Lyme disease cases using self-reported worldwide tweets: A deep learning modelling approach enhanced with sentimental words through emojis.
ABSTRACT
Background:
Lyme disease is the most prevalent tick-borne disease in the Northern Hemisphere. Delayed treatment can exacerbate symptoms and result in more severe cases, making this condition a major public health concern in the coming years. Additionally, the Lyme disease surveillance system relies on healthcare professionals to report cases, which weakens the system's efficiency in having accurate data since only the cases seeking medical attention are reported. Thus, there is a need to enhance the surveillance tools of Lyme disease using other data sources such as web-data.
Objective:
Worldwide Twitter data was analyzed to understand its potential and its limitations as a tool for Lyme disease surveillance. The proposed Twitter data system is primarily a transformer-based classifier that leverages self-reported tweets to identify potential cases of Lyme disease.
Methods:
We first used approximately 20,000 English tweets collected worldwide from a database with more than 1.3 million tweets related to Lyme disease. Because most Lyme disease tweets are from the US, we selected only 20,000 tweets, from which about 10% represented other countries than the US, to capture more variability across countries. After preprocessing and geolocating the tweets, a set of carefully selected keywords was used to manually label a subset of tweets to classify them as potential or non-Lyme disease cases. Emojis were converted to sentiment words and then used in place of emojis in the tweets. The dataset of labelled tweets was then used to train, validate, and test the performance of three transform-based classifier variants, namely ALBERT, DistilBERT, and BERTweet, to classify the remaining and other new tweets.
Results:
The empirical results showed that BERTweet is the best classifier among all classification models evaluated, with the highest average F1-score of 89.3%, classification accuracy of 90.0%, precision of 97.1%, except for the recall where TF-IDF and k-Nearest Neighbors perform better by 93.2 % against 82.6% for BERTweet. When emojis' expressions were used to enrich the tweet embeddings, the recall score for BERTweet increased by 8%, and DistilBERT had a markedly increased F1-score of 93.8% (+4%) and a classification accuracy of 94.1% (+4%), while ALBERT had a F1-score of 93.1% (5%) and a classification accuracy of 93.9% (+5%).
Conclusions:
This study revealed several key findings. First, that BERTweet and DistilBERT can serve as robust NLP classifiers to identify self-reported potential cases of Lyme disease. Second, emojis are effective as enrichment features to improve the accuracy of the tweet embedding and the performance of transformer-based classifiers. In particular, the emojis reflecting sadness, empathy, and encouragement can help reduce false negatives. Third, the general awareness of Lyme disease is high in the United States, the United Kingdom, Australia, and Canada as self-reported potential cases of Lyme disease on Twitter from these countries accounted for more than 50% of the collected English tweets, while Lyme disease-related tweets are scarce in countries from Africa and Asia. Finally, the most commonly reported symptoms of Lyme disease are rash, fatigue, fever, and arthritis while symptoms such as borrelial lymphocytoma, palpitations, swollen lymph nodes, neck stiffness, and irregular heartbeat are unusual and rare.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.