Accepted for/Published in: JMIR Public Health and Surveillance
Date Submitted: Nov 17, 2020
Date Accepted: Feb 9, 2021
Predicting Age Groups of Reddit Users based on Posting Behavior and Metadata: Comparative Study of Classification Models
ABSTRACT
Background:
Social media is an important medium for monitoring perceptions of public health issues and for educating target audiences about health. However, limited information about the demographics of social media users makes it challenging to identify conversations among target audiences, which limits the application of social media insights for public health surveillance and education outreach efforts. Certain social media platforms provide demographic information on followers of a user account, if given, but they are not always disclosed. Researchers have developed machine learning algorithms to predict demographic characteristics of social media users (e.g., age, gender), but mainly on Twitter. To date, there has been limited research on predicting demographic characteristics of users on Reddit.
Objective:
Develop a machine learning algorithm that predicts the age segment of Reddit users, as either youth or adults, based on publicly available data.
Methods:
We manually labeled Reddit users’ age by identifying and reviewing public posts in which Reddit users self-reported their age. We then collected sample posts, comments, and metadata for the labeled user accounts and created variables to capture linguistic patterns, posting behavior, and account details that would distinguish the youth age group (aged 13 to 20) from the adult age group (aged 21 to 54). We split the data into training and test sets and performed 5-fold cross validation on the training set to select hyperparameters and perform feature selection. We ran multiple classification algorithms and tested the performance of the models (precision, recall, F1-score) to accurately predict the age segments in the labeled data. To evaluate associations between each feature and the outcome, we calculated means, confidence intervals, and two-sample t-tests between the two age groups for each transformed model feature.
Results:
The gradient boosted trees classifier performed the best, with overall F1 score of 0.80. The precision and recall score was 0.81 and 0.88 respectively for the 13–20 age group and 0.78 and 0.66 respectively for the 21–54 age group. The most important feature in the model was the number of sentences per comment (permutation score mean: 0.100, std: 0.004). When compared with the 21–54 age group, members of the younger 13–20 age group tend to have created an account more recently, have a higher proportion of submissions and comments in the r/teenagers subreddit, and post more in subreddits with higher subscriber count.
Conclusions:
We created a Reddit age prediction algorithm with competitive accuracy using publicly available data, suggesting machine learning methods can help public health agencies identify age-related target audiences on Reddit. Our results also suggest that there are characteristics of Reddit users’ posting behavior, linguistic patterns, and account features that distinguish youth from adults.
Citation
Request queued. Please wait while the file is being generated. It may take some time.