Cargando…

Optimizing data collection for public health decisions: a data mining approach

BACKGROUND: Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining te...

Descripción completa

Detalles Bibliográficos
Autores principales:	Partington, Susan N, Papakroni, Vasil, Menzies, Tim
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4077265/ https://www.ncbi.nlm.nih.gov/pubmed/24919484 http://dx.doi.org/10.1186/1471-2458-14-593

_version_	1782323580003942400
author	Partington, Susan N Papakroni, Vasil Menzies, Tim
author_facet	Partington, Susan N Papakroni, Vasil Menzies, Tim
author_sort	Partington, Susan N
collection	PubMed
description	BACKGROUND: Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. METHODS: The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. RESULTS: Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R(2) values of 92% and 94% for restaurant and grocery store data, respectively. CONCLUSIONS: While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost.
format	Online Article Text
id	pubmed-4077265
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-40772652014-07-02 Optimizing data collection for public health decisions: a data mining approach Partington, Susan N Papakroni, Vasil Menzies, Tim BMC Public Health Research Article BACKGROUND: Collecting data can be cumbersome and expensive. Lack of relevant, accurate and timely data for research to inform policy may negatively impact public health. The aim of this study was to test if the careful removal of items from two community nutrition surveys guided by a data mining technique called feature selection, can (a) identify a reduced dataset, while (b) not damaging the signal inside that data. METHODS: The Nutrition Environment Measures Surveys for stores (NEMS-S) and restaurants (NEMS-R) were completed on 885 retail food outlets in two counties in West Virginia between May and November of 2011. A reduced dataset was identified for each outlet type using feature selection. Coefficients from linear regression modeling were used to weight items in the reduced datasets. Weighted item values were summed with the error term to compute reduced item survey scores. Scores produced by the full survey were compared to the reduced item scores using a Wilcoxon rank-sum test. RESULTS: Feature selection identified 9 store and 16 restaurant survey items as significant predictors of the score produced from the full survey. The linear regression models built from the reduced feature sets had R(2) values of 92% and 94% for restaurant and grocery store data, respectively. CONCLUSIONS: While there are many potentially important variables in any domain, the most useful set may only be a small subset. The use of feature selection in the initial phase of data collection to identify the most influential variables may be a useful tool to greatly reduce the amount of data needed thereby reducing cost. BioMed Central 2014-06-12 /pmc/articles/PMC4077265/ /pubmed/24919484 http://dx.doi.org/10.1186/1471-2458-14-593 Text en Copyright © 2014 Partington et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Partington, Susan N Papakroni, Vasil Menzies, Tim Optimizing data collection for public health decisions: a data mining approach
title	Optimizing data collection for public health decisions: a data mining approach
title_full	Optimizing data collection for public health decisions: a data mining approach
title_fullStr	Optimizing data collection for public health decisions: a data mining approach
title_full_unstemmed	Optimizing data collection for public health decisions: a data mining approach
title_short	Optimizing data collection for public health decisions: a data mining approach
title_sort	optimizing data collection for public health decisions: a data mining approach
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4077265/ https://www.ncbi.nlm.nih.gov/pubmed/24919484 http://dx.doi.org/10.1186/1471-2458-14-593
work_keys_str_mv	AT partingtonsusann optimizingdatacollectionforpublichealthdecisionsadataminingapproach AT papakronivasil optimizingdatacollectionforpublichealthdecisionsadataminingapproach AT menziestim optimizingdatacollectionforpublichealthdecisionsadataminingapproach

Optimizing data collection for public health decisions: a data mining approach

Ejemplares similares