Cargando…

Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India

BACKGROUND: Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light...

Descripción completa

Detalles Bibliográficos
Autores principales:	Shah, Neha, Mohan, Diwakar, Bashingwa, Jean Juste Harisson, Ummer, Osama, Chakraborty, Arpita, LeFevre, Amnesty E
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	JMIR Publications 2020
Materias:	Protocol
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7439143/ https://www.ncbi.nlm.nih.gov/pubmed/32755886 http://dx.doi.org/10.2196/17619

_version_	1783572922289356800
author	Shah, Neha Mohan, Diwakar Bashingwa, Jean Juste Harisson Ummer, Osama Chakraborty, Arpita LeFevre, Amnesty E
author_facet	Shah, Neha Mohan, Diwakar Bashingwa, Jean Juste Harisson Ummer, Osama Chakraborty, Arpita LeFevre, Amnesty E
author_sort	Shah, Neha
collection	PubMed
description	BACKGROUND: Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. OBJECTIVE: This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. METHODS: In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. RESULTS: Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. CONCLUSIONS: Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/17619
format	Online Article Text
id	pubmed-7439143
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	JMIR Publications
record_format	MEDLINE/PubMed
spelling	pubmed-74391432020-08-31 Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India Shah, Neha Mohan, Diwakar Bashingwa, Jean Juste Harisson Ummer, Osama Chakraborty, Arpita LeFevre, Amnesty E JMIR Res Protoc Protocol BACKGROUND: Data quality is vital for ensuring the accuracy, reliability, and validity of survey findings. Strategies for ensuring survey data quality have traditionally used quality assurance procedures. Data analytics is an increasingly vital part of survey quality assurance, particularly in light of the increasing use of tablets and other electronic tools, which enable rapid, if not real-time, data access. Routine data analytics are most often concerned with outlier analyses that monitor a series of data quality indicators, including response rates, missing data, and reliability of coefficients for test-retest interviews. Machine learning is emerging as a possible tool for enhancing real-time data monitoring by identifying trends in the data collection, which could compromise quality. OBJECTIVE: This study aimed to describe methods for the quality assessment of a household survey using both traditional methods as well as machine learning analytics. METHODS: In the Kilkari impact evaluation’s end-line survey amongst postpartum women (n=5095) in Madhya Pradesh, India, we plan to use both traditional and machine learning–based quality assurance procedures to improve the quality of survey data captured on maternal and child health knowledge, care-seeking, and practices. The quality assurance strategy aims to identify biases and other impediments to data quality and includes seven main components: (1) tool development, (2) enumerator recruitment and training, (3) field coordination, (4) field monitoring, (5) data analytics, (6) feedback loops for decision making, and (7) outcomes assessment. Analyses will include basic descriptive and outlier analyses using machine learning algorithms, which will involve creating features from time-stamps, “don’t know” rates, and skip rates. We will also obtain labeled data from self-filled surveys, and build models using k-folds cross-validation on a training data set using both supervised and unsupervised learning algorithms. Based on these models, results will be fed back to the field through various feedback loops. RESULTS: Data collection began in late October 2019 and will span through March 2020. We expect to submit quality assurance results by August 2020. CONCLUSIONS: Machine learning is underutilized as a tool to improve survey data quality in low resource settings. Study findings are anticipated to improve the overall quality of Kilkari survey data and, in turn, enhance the robustness of the impact evaluation. More broadly, the proposed quality assurance approach has implications for data capture applications used for special surveys as well as in the routine collection of health information by health workers. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID): DERR1-10.2196/17619 JMIR Publications 2020-08-05 /pmc/articles/PMC7439143/ /pubmed/32755886 http://dx.doi.org/10.2196/17619 Text en ©Neha Shah, Diwakar Mohan, Jean Juste Harisson Bashingwa, Osama Ummer, Arpita Chakraborty, Amnesty E. LeFevre. Originally published in JMIR Research Protocols (http://www.researchprotocols.org), 05.08.2020. https://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Research Protocols, is properly cited. The complete bibliographic information, a link to the original publication on http://www.researchprotocols.org, as well as this copyright and license information must be included.
spellingShingle	Protocol Shah, Neha Mohan, Diwakar Bashingwa, Jean Juste Harisson Ummer, Osama Chakraborty, Arpita LeFevre, Amnesty E Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India
title	Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India
title_full	Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India
title_fullStr	Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India
title_full_unstemmed	Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India
title_short	Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India
title_sort	using machine learning to optimize the quality of survey data: protocol for a use case in india
topic	Protocol
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7439143/ https://www.ncbi.nlm.nih.gov/pubmed/32755886 http://dx.doi.org/10.2196/17619
work_keys_str_mv	AT shahneha usingmachinelearningtooptimizethequalityofsurveydataprotocolforausecaseinindia AT mohandiwakar usingmachinelearningtooptimizethequalityofsurveydataprotocolforausecaseinindia AT bashingwajeanjusteharisson usingmachinelearningtooptimizethequalityofsurveydataprotocolforausecaseinindia AT ummerosama usingmachinelearningtooptimizethequalityofsurveydataprotocolforausecaseinindia AT chakrabortyarpita usingmachinelearningtooptimizethequalityofsurveydataprotocolforausecaseinindia AT lefevreamnestye usingmachinelearningtooptimizethequalityofsurveydataprotocolforausecaseinindia

Using Machine Learning to Optimize the Quality of Survey Data: Protocol for a Use Case in India

Ejemplares similares