Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Mar 13, 2023
Date Accepted: Aug 31, 2023

The final, peer-reviewed published version of this preprint can be found here:

Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis

Laison EKE, Hamza Ibrahim M, Boligarla S, Li J, Mahadevan R, Ng A, Muthuramalingam V, Lee JWY, Yin Y, Nasri B

Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis

J Med Internet Res 2023;25:e47014

DOI: 10.2196/47014

PMID: 37843893

PMCID: 10616745

Identifying potential Lyme disease cases using self-reported worldwide tweets: A deep learning modelling approach enhanced with sentimental words through emojis.

  • Elda K. E. Laison; 
  • Mohamed Hamza Ibrahim; 
  • Srikanth Boligarla; 
  • Jiaxin Li; 
  • Raja Mahadevan; 
  • Austen Ng; 
  • Venkataraman Muthuramalingam; 
  • Jack Wee Yi Lee; 
  • Yijun Yin; 
  • Bouchra Nasri

ABSTRACT

Background:

Lyme disease is the most prevalent tick-borne disease in the Northern Hemisphere. Delayed treatment can exacerbate symptoms and result in more severe cases, making this condition a major public health concern in the coming years. Additionally, the Lyme disease surveillance system relies on healthcare professionals to report cases, which weakens the system's efficiency in having accurate data since only the cases seeking medical attention are reported. Thus, there is a need to enhance the surveillance tools of Lyme disease using other data sources such as web-data.

Objective:

Worldwide Twitter data was analyzed to understand its potential and its limitations as a tool for Lyme disease surveillance. The proposed Twitter data system is primarily a transformer-based classifier that leverages self-reported tweets to identify potential cases of Lyme disease.

Methods:

We first used approximately 20,000 English tweets collected worldwide from a database with more than 1.3 million tweets related to Lyme disease. Because most Lyme disease tweets are from the US, we selected only 20,000 tweets, from which about 10% represented other countries than the US, to capture more variability across countries. After preprocessing and geolocating the tweets, a set of carefully selected keywords was used to manually label a subset of tweets to classify them as potential or non-Lyme disease cases. Emojis were converted to sentiment words and then used in place of emojis in the tweets. The dataset of labelled tweets was then used to train, validate, and test the performance of three transform-based classifier variants, namely ALBERT, DistilBERT, and BERTweet, to classify the remaining and other new tweets.

Results:

The empirical results showed that BERTweet is the best classifier among all classification models evaluated, with the highest average F1-score of 89.3%, classification accuracy of 90.0%, precision of 97.1%, except for the recall where TF-IDF and k-Nearest Neighbors perform better by 93.2 % against 82.6% for BERTweet. When emojis' expressions were used to enrich the tweet embeddings, the recall score for BERTweet increased by 8%, and DistilBERT had a markedly increased F1-score of 93.8% (+4%) and a classification accuracy of 94.1% (+4%), while ALBERT had a F1-score of 93.1% (5%) and a classification accuracy of 93.9% (+5%).

Conclusions:

This study revealed several key findings. First, that BERTweet and DistilBERT can serve as robust NLP classifiers to identify self-reported potential cases of Lyme disease. Second, emojis are effective as enrichment features to improve the accuracy of the tweet embedding and the performance of transformer-based classifiers. In particular, the emojis reflecting sadness, empathy, and encouragement can help reduce false negatives. Third, the general awareness of Lyme disease is high in the United States, the United Kingdom, Australia, and Canada as self-reported potential cases of Lyme disease on Twitter from these countries accounted for more than 50% of the collected English tweets, while Lyme disease-related tweets are scarce in countries from Africa and Asia. Finally, the most commonly reported symptoms of Lyme disease are rash, fatigue, fever, and arthritis while symptoms such as borrelial lymphocytoma, palpitations, swollen lymph nodes, neck stiffness, and irregular heartbeat are unusual and rare.


 Citation

Please cite as:

Laison EKE, Hamza Ibrahim M, Boligarla S, Li J, Mahadevan R, Ng A, Muthuramalingam V, Lee JWY, Yin Y, Nasri B

Identifying Potential Lyme Disease Cases Using Self-Reported Worldwide Tweets: Deep Learning Modeling Approach Enhanced With Sentimental Words Through Emojis

J Med Internet Res 2023;25:e47014

DOI: 10.2196/47014

PMID: 37843893

PMCID: 10616745

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.