JMIR Preprints #47923: Methods and Annotated Datasets Used to Predict Gender and Age of Twitter Users: A Scoping Review

Current Preprint Settings

(as selected by the authors)

1. When the manuscript is submitted, allow peer review from:

(a) Anybody (open community peer review)
(b) Editor-selected reviewers (closed peer review)

2. When the manuscript is submitted, display the preprint PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

3. When the manuscript is accepted, display the accepted manuscript PDF to:

(a) Anybody, anytime
(b) Logged-in users only
(c) Anybody, anytime (title and abstract only)
(d) No one

Methods and Annotated Datasets Used to Predict Gender and Age of Twitter Users: A Scoping Review

Karen O'Connor;
Su Golder;
Davy Weissenbacher;
Ari Z Klein;
Arjun Magge;
Graciela Gonzalez-Hernandez

ABSTRACT

Background:

Real World Data (RWD) has been identified as a key information source in health and social science research. An important, and readily available source of RWD is social media. Identifying the gender and age of the authors of social media posts is necessary for assessing the representativeness of the sample by these key demographics and enables researchers to study subgroups and disparities. However, deciphering the age and gender of social media users can be challenging. We present a scoping review of the literature and summarize the automated methods used to predict age and gender of Twitter users.

Objective:

To summarize methods used to predict or infer age and gender from Twitter users from 2017 on.

Methods:

We undertook a scoping review of the literature to summarize the methods used to extract age and gender of Twitter users. We searched 15 electronic databases, google scholar and carried out reference checking to retrieve studies that met our inclusion criteria. Screening was undertaken independently by two researchers.

Results:

From 684 records, 74 met our inclusion criteria. 42 focused on predicting gender only, 8 predicted ages only and 24 predicted a combination of both. The heterogeneous nature of the studies and the lack of consistent performance measures made any synthesis of the results difficult.

Conclusions:

We found that although methods to extract age and gender evolved over time to utilize deep neural networks, many still relied on more traditional machine learning methods. Gender prediction has achieved higher reported performance, while prediction of age performance lags, particularly for more granular age groups. However, the heterogeneous nature of the studies and the lack of consistent performance measures made it impossible to quantitively synthesize results. We found evidence that data bias is a prevalent problem and discuss suggestions to minimize it for future studies.

Citation

Please cite as:

O'Connor K, Golder S, Weissenbacher D, Klein AZ, Magge A, Gonzalez-Hernandez G

Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review

J Med Internet Res 2024;26:e47923

DOI: 10.2196/47923

PMID: 38488839

PMCID: 10980991

Download PDF

Request queued. Please wait while the file is being generated. It may take some time.

Copyright

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.

JMIR Publications

JMIR Preprints

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Apr 5, 2023

Date Accepted: Aug 1, 2023

Methods and Annotated Datasets Used to Predict Gender and Age of Twitter Users: A Scoping Review

ABSTRACT

Citation

Copyright