Accepted for/Published in: Journal of Medical Internet Research
Date Submitted: Apr 5, 2023
Date Accepted: Aug 1, 2023
Methods and Annotated Datasets Used to Predict Gender and Age of Twitter Users: A Scoping Review
ABSTRACT
Background:
Real World Data (RWD) has been identified as a key information source in health and social science research. An important, and readily available source of RWD is social media. Identifying the gender and age of the authors of social media posts is necessary for assessing the representativeness of the sample by these key demographics and enables researchers to study subgroups and disparities. However, deciphering the age and gender of social media users can be challenging. We present a scoping review of the literature and summarize the automated methods used to predict age and gender of Twitter users.
Objective:
To summarize methods used to predict or infer age and gender from Twitter users from 2017 on.
Methods:
We undertook a scoping review of the literature to summarize the methods used to extract age and gender of Twitter users. We searched 15 electronic databases, google scholar and carried out reference checking to retrieve studies that met our inclusion criteria. Screening was undertaken independently by two researchers.
Results:
From 684 records, 74 met our inclusion criteria. 42 focused on predicting gender only, 8 predicted ages only and 24 predicted a combination of both. The heterogeneous nature of the studies and the lack of consistent performance measures made any synthesis of the results difficult.
Conclusions:
We found that although methods to extract age and gender evolved over time to utilize deep neural networks, many still relied on more traditional machine learning methods. Gender prediction has achieved higher reported performance, while prediction of age performance lags, particularly for more granular age groups. However, the heterogeneous nature of the studies and the lack of consistent performance measures made it impossible to quantitively synthesize results. We found evidence that data bias is a prevalent problem and discuss suggestions to minimize it for future studies.
Citation
Request queued. Please wait while the file is being generated. It may take some time.
Copyright
© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.