Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Dec 18, 2020
Date Accepted: Apr 3, 2021

The final, peer-reviewed published version of this preprint can be found here:

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid

Yang YC, Al-Garadi MA, Bremer W, Zhu JM, Grande D, Sarker A

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid

J Med Internet Res 2021;23(5):e26616

DOI: 10.2196/26616

PMID: 33938807

PMCID: 8129876

Developing an automatic system for classifying chatter about health services from Twitter: A case study for Medicaid

  • Yuan-Chi Yang; 
  • Mohammed Ali Al-Garadi; 
  • Whitney Bremer; 
  • Jane M. Zhu; 
  • David Grande; 
  • Abeed Sarker

ABSTRACT

Background:

The wide adoption of social media in daily life renders it a rich and effective resource for conducting close-to-real-time assessments of consumers’ perceptions about health services. This is, however, challenging due to the vast amount of data and the diverse content in the social media chatter.

Objective:

To develop and evaluate an automatic system, involving natural language processing and machine learning, for automatically characterizing user-posted Twitter data about healer services, using Medicaid, the single largest insurance in the United States, as an example.

Methods:

We collected data from Twitter in two ways: (i) via the public streaming API using Medicaid-related keywords (Corpus-1), and (ii) by using the website’s search option for tweets mentioning the agency-specific handles (Corpus-2). We manually labeled a sample of tweets into five pre-determined categories or other, and artificially increased the number of training posts from specific low-frequency categories. Using the manually-labeled data, we trained and evaluated several supervised learning algorithms, including Support Vector Machine, Random Forest (RF), Naïve Bayes, shallow Neural Network (NN), k-Nearest Neighbor, Bi-Directional Long Short-Term Memory, and Bidirectional Encoder Representations from Transformers (BERT). We then applied the best-performing classifier to the collected tweets for post-classification analyses assessing the utility of our methods.

Results:

We manually annotated 11,379 (Corpus-1: 9,179; Corpus-2: 2,200) tweets, using 7,930 (69.7%) for training and 1,449 (12.7%) for validation and 2,000 (17.6%) for test. A BERT-based classifier obtained the highest accuracies (81.7%, Corpus-1; 80.7%, Corpus-2) and F1-score on Consumer Feedback (0.58, Corpus-1; 0.90, Corpus-2), outperforming the second-best classifiers in accuracies (74.6%, RF on Corpus-1; 69.4%, RF on Corpus-2) and F1-score on Consumer Feedback (0.44, NN on Corpus-1; 0.82, RF on Corpus-2). Post-classification analyses revealed differing inter-corpora distributions of tweet categories, with political (64%) and consumer-feedback (55%) tweets being the most frequent for Corpus-1 and -2, respectively.

Conclusions:

The broad and variable content of Medicaid-related tweets necessitates automatic categorization to identify topic-relevant posts. Our proposed system presents a feasible solution for automatic categorization, and can be deployed/generalized for health service programs other than Medicaid. Annotated data and methods are available for future studies (https://yyang60@bitbucket.org/sarkerlab/medicaid-classification-script-and-data-for-public).


 Citation

Please cite as:

Yang YC, Al-Garadi MA, Bremer W, Zhu JM, Grande D, Sarker A

Developing an Automatic System for Classifying Chatter About Health Services on Twitter: Case Study for Medicaid

J Med Internet Res 2021;23(5):e26616

DOI: 10.2196/26616

PMID: 33938807

PMCID: 8129876

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.