Cargando…

Improving prevalence estimation through data fusion: methods and validation

BACKGROUND: Estimation of health prevalences is usually performed with a single survey. Some attempts have been made to integrate more than one source of data. We propose here to validate this approach through data fusion. Data Fusion is the process of integrating two sources of data into one combin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Aluja-Banet, Tomàs, Daunis-i-Estadella, Josep, Brunsó, Núria, Mompart-Penina, Anna
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4478714/ https://www.ncbi.nlm.nih.gov/pubmed/26104747 http://dx.doi.org/10.1186/s12911-015-0169-z

_version_	1782377929893740544
author	Aluja-Banet, Tomàs Daunis-i-Estadella, Josep Brunsó, Núria Mompart-Penina, Anna
author_facet	Aluja-Banet, Tomàs Daunis-i-Estadella, Josep Brunsó, Núria Mompart-Penina, Anna
author_sort	Aluja-Banet, Tomàs
collection	PubMed
description	BACKGROUND: Estimation of health prevalences is usually performed with a single survey. Some attempts have been made to integrate more than one source of data. We propose here to validate this approach through data fusion. Data Fusion is the process of integrating two sources of data into one combined file. It allows us to take even greater advantage of existing information collected in databases. Here, we use data fusion to improve the estimation of health prevalences for two primary health factors: cardiovascular diseases and diabetes. METHODS: We use a real data fusion operation on population health, where the imputation of basic health risk factors is used to enrich a large-scale survey on self-reported health status. We propose choosing the imputation methodology for this problem through a suite of validation statistics that assess the quality of the fused data. The compared imputation techniques have been chosen from among the main imputation methodologies: k-nearest neighbor, probabilistic modeling and regression. We use the 2006 Health Survey of Catalonia, which provides a complete report of the perceived health status. In order to deal with the uncertainty problem, we compare these methodologies under the single and multiple imputation frames. RESULTS: A suite of validation statistics allows us to discern the strengths and weaknesses of studied imputation methods. Multiple outperforms single imputation by providing better and much more stable estimates, according to the computed validation statistics. The summarized results indicate that the probabilistic methods preserve the multivariate structure better; sequential regression methods deliver greater accuracy of imputed data; and nearest neighbor methods end up with a more realistic distribution of imputed data. CONCLUSIONS: Data fusion allows us to integrate two sources of information in order to take grater advantage of the available data. Multiple imputed sequential regression models have the advantage of grater interpretability and can be used for health policy. Under certain conditions, more accurate estimates of the prevalences can be obtained using fused data (the original data plus the imputed data) than just by using only the observed data.
format	Online Article Text
id	pubmed-4478714
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-44787142015-06-25 Improving prevalence estimation through data fusion: methods and validation Aluja-Banet, Tomàs Daunis-i-Estadella, Josep Brunsó, Núria Mompart-Penina, Anna BMC Med Inform Decis Mak Research Article BACKGROUND: Estimation of health prevalences is usually performed with a single survey. Some attempts have been made to integrate more than one source of data. We propose here to validate this approach through data fusion. Data Fusion is the process of integrating two sources of data into one combined file. It allows us to take even greater advantage of existing information collected in databases. Here, we use data fusion to improve the estimation of health prevalences for two primary health factors: cardiovascular diseases and diabetes. METHODS: We use a real data fusion operation on population health, where the imputation of basic health risk factors is used to enrich a large-scale survey on self-reported health status. We propose choosing the imputation methodology for this problem through a suite of validation statistics that assess the quality of the fused data. The compared imputation techniques have been chosen from among the main imputation methodologies: k-nearest neighbor, probabilistic modeling and regression. We use the 2006 Health Survey of Catalonia, which provides a complete report of the perceived health status. In order to deal with the uncertainty problem, we compare these methodologies under the single and multiple imputation frames. RESULTS: A suite of validation statistics allows us to discern the strengths and weaknesses of studied imputation methods. Multiple outperforms single imputation by providing better and much more stable estimates, according to the computed validation statistics. The summarized results indicate that the probabilistic methods preserve the multivariate structure better; sequential regression methods deliver greater accuracy of imputed data; and nearest neighbor methods end up with a more realistic distribution of imputed data. CONCLUSIONS: Data fusion allows us to integrate two sources of information in order to take grater advantage of the available data. Multiple imputed sequential regression models have the advantage of grater interpretability and can be used for health policy. Under certain conditions, more accurate estimates of the prevalences can be obtained using fused data (the original data plus the imputed data) than just by using only the observed data. BioMed Central 2015-06-24 /pmc/articles/PMC4478714/ /pubmed/26104747 http://dx.doi.org/10.1186/s12911-015-0169-z Text en © Aluja-Banet et al. 2015 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Aluja-Banet, Tomàs Daunis-i-Estadella, Josep Brunsó, Núria Mompart-Penina, Anna Improving prevalence estimation through data fusion: methods and validation
title	Improving prevalence estimation through data fusion: methods and validation
title_full	Improving prevalence estimation through data fusion: methods and validation
title_fullStr	Improving prevalence estimation through data fusion: methods and validation
title_full_unstemmed	Improving prevalence estimation through data fusion: methods and validation
title_short	Improving prevalence estimation through data fusion: methods and validation
title_sort	improving prevalence estimation through data fusion: methods and validation
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4478714/ https://www.ncbi.nlm.nih.gov/pubmed/26104747 http://dx.doi.org/10.1186/s12911-015-0169-z
work_keys_str_mv	AT alujabanettomas improvingprevalenceestimationthroughdatafusionmethodsandvalidation AT daunisiestadellajosep improvingprevalenceestimationthroughdatafusionmethodsandvalidation AT brunsonuria improvingprevalenceestimationthroughdatafusionmethodsandvalidation AT mompartpeninaanna improvingprevalenceestimationthroughdatafusionmethodsandvalidation

Improving prevalence estimation through data fusion: methods and validation

Ejemplares similares