Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: Journal of Medical Internet Research

Date Submitted: Apr 5, 2023
Date Accepted: Aug 1, 2023

The final, peer-reviewed published version of this preprint can be found here:

Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review

O'Connor K, Golder S, Weissenbacher D, Klein AZ, Magge A, Gonzalez-Hernandez G

Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review

J Med Internet Res 2024;26:e47923

DOI: 10.2196/47923

PMID: 38488839

PMCID: 10980991

Methods and Annotated Datasets Used to Predict Gender and Age of Twitter Users: A Scoping Review

  • Karen O'Connor; 
  • Su Golder; 
  • Davy Weissenbacher; 
  • Ari Z Klein; 
  • Arjun Magge; 
  • Graciela Gonzalez-Hernandez

ABSTRACT

Background:

Real World Data (RWD) has been identified as a key information source in health and social science research. An important, and readily available source of RWD is social media. Identifying the gender and age of the authors of social media posts is necessary for assessing the representativeness of the sample by these key demographics and enables researchers to study subgroups and disparities. However, deciphering the age and gender of social media users can be challenging. We present a scoping review of the literature and summarize the automated methods used to predict age and gender of Twitter users.

Objective:

To summarize methods used to predict or infer age and gender from Twitter users from 2017 on.

Methods:

We undertook a scoping review of the literature to summarize the methods used to extract age and gender of Twitter users. We searched 15 electronic databases, google scholar and carried out reference checking to retrieve studies that met our inclusion criteria. Screening was undertaken independently by two researchers.

Results:

From 684 records, 74 met our inclusion criteria. 42 focused on predicting gender only, 8 predicted ages only and 24 predicted a combination of both. The heterogeneous nature of the studies and the lack of consistent performance measures made any synthesis of the results difficult.

Conclusions:

We found that although methods to extract age and gender evolved over time to utilize deep neural networks, many still relied on more traditional machine learning methods. Gender prediction has achieved higher reported performance, while prediction of age performance lags, particularly for more granular age groups. However, the heterogeneous nature of the studies and the lack of consistent performance measures made it impossible to quantitively synthesize results. We found evidence that data bias is a prevalent problem and discuss suggestions to minimize it for future studies.


 Citation

Please cite as:

O'Connor K, Golder S, Weissenbacher D, Klein AZ, Magge A, Gonzalez-Hernandez G

Methods and Annotated Data Sets Used to Predict the Gender and Age of Twitter Users: Scoping Review

J Med Internet Res 2024;26:e47923

DOI: 10.2196/47923

PMID: 38488839

PMCID: 10980991

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.

© The authors. All rights reserved. This is a privileged document currently under peer-review/community review (or an accepted/rejected manuscript). Authors have provided JMIR Publications with an exclusive license to publish this preprint on it's website for review and ahead-of-print citation purposes only. While the final peer-reviewed paper may be licensed under a cc-by license on publication, at this stage authors and publisher expressively prohibit redistribution of this draft paper other than for review purposes.