Maintenance Notice

Due to necessary scheduled maintenance, the JMIR Publications website will be unavailable from Wednesday, July 01, 2020 at 8:00 PM to 10:00 PM EST. We apologize in advance for any inconvenience this may cause you.

Who will be affected?

Accepted for/Published in: JMIR Research Protocols

Date Submitted: Dec 29, 2019
Date Accepted: Jun 13, 2020

The final, peer-reviewed published version of this preprint can be found here:

Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India

Shah N, Mohan D, Bashingwa JJH, Ummer O, Chakraborty A, LeFevre AE

Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India

JMIR Res Protoc 2020;9(8):e17619

DOI: 10.2196/17619

PMID: 32755886

PMCID: 7439143

Using Machine Learning to optimize the quality of survey data: protocol for a use case in India

  • Neha Shah; 
  • Diwakar Mohan; 
  • Jean Juste Harisson Bashingwa; 
  • Osama Ummer; 
  • Arpita Chakraborty; 
  • Amnesty E. LeFevre

ABSTRACT

Background:

Data quality is vital for ensuring the accuracy, reliability and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance (QA) procedures. Data analytics are an increasingly vital part of survey QA, particularly in light of the increasing use of tablets and other electronic tools which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses which monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning (ML) is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection which could compromise quality.

Objective:

To describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics.

Methods:

In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5,095) in Madhya Pradesh, India, we plan to use both traditional and ML QA procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The QA strategy aims to identify biases and other impediments to data quality and includes six main components: 1. Tool development; 2. Enumerator recruitment and training; 3. Field coordination; 4. Field monitoring; 5. Data analytics; and 6. Feedback loops for decision-making. Data analytics will include basic descriptive analysis as well as outlier analyses using machine learning algorithms which will involve creating features from timestamps, don’t know rates, and skip rates; obtaining labeled data from self-filled surveys; and building models using k-folds cross-validation on a training set of the data using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops.

Results:

Data collection began in late October and will span through March 2020. We expect to submit QA results by May 2020.

Conclusions:

ML is under-utilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and in turn, enhance the robustness of the impact evaluation. More broadly, the QA approach proposed has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers.


 Citation

Please cite as:

Shah N, Mohan D, Bashingwa JJH, Ummer O, Chakraborty A, LeFevre AE

Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India

JMIR Res Protoc 2020;9(8):e17619

DOI: 10.2196/17619

PMID: 32755886

PMCID: 7439143

Download PDF


Request queued. Please wait while the file is being generated. It may take some time.