Cargando…

Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections

BACKGROUND: Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome. METHODS: We searched for associations between 303 covariates and influenza infection in 498...

Descripción completa

Detalles Bibliográficos
Autores principales: Mansiaux, Yohann, Carrat, Fabrice
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4146451/
https://www.ncbi.nlm.nih.gov/pubmed/25154404
http://dx.doi.org/10.1186/1471-2288-14-99
_version_ 1782332344626053120
author Mansiaux, Yohann
Carrat, Fabrice
author_facet Mansiaux, Yohann
Carrat, Fabrice
author_sort Mansiaux, Yohann
collection PubMed
description BACKGROUND: Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome. METHODS: We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BRT); the conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regression - UFMLR) and the Least Absolute Shrinkage and Selection Operator (LASSO) with penalty in multivariate logistic regression to achieve a sparse selection of covariates. We developed permutations tests to assess the statistical significance of associations. We simulated 500 similar sized datasets to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods. RESULTS: Between 3 and 24 covariates (1%-8%) were identified as associated with influenza infection depending on the method. The pre-seasonal haemagglutination inhibition antibody titer was the unique covariate selected with all methods while 266 (87%) covariates were not selected by any method. At 5% nominal significance level, the TPR were 85% with RF, 80% with BRT, 26% to 49% with UFMLR, 71% to 78% with LASSO. Conversely, the FPR were 4% with RF and BRT, 9% to 2% with UFMLR, and 9% to 4% with LASSO. CONCLUSIONS: Data mining methods and LASSO should be considered as valuable methods to detect independent associations in large epidemiologic datasets.
format Online
Article
Text
id pubmed-4146451
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-41464512014-09-02 Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections Mansiaux, Yohann Carrat, Fabrice BMC Med Res Methodol Research Article BACKGROUND: Big data is steadily growing in epidemiology. We explored the performances of methods dedicated to big data analysis for detecting independent associations between exposures and a health outcome. METHODS: We searched for associations between 303 covariates and influenza infection in 498 subjects (14% infected) sampled from a dedicated cohort. Independent associations were detected using two data mining methods, the Random Forests (RF) and the Boosted Regression Trees (BRT); the conventional logistic regression framework (Univariate Followed by Multivariate Logistic Regression - UFMLR) and the Least Absolute Shrinkage and Selection Operator (LASSO) with penalty in multivariate logistic regression to achieve a sparse selection of covariates. We developed permutations tests to assess the statistical significance of associations. We simulated 500 similar sized datasets to estimate the True (TPR) and False (FPR) Positive Rates associated with these methods. RESULTS: Between 3 and 24 covariates (1%-8%) were identified as associated with influenza infection depending on the method. The pre-seasonal haemagglutination inhibition antibody titer was the unique covariate selected with all methods while 266 (87%) covariates were not selected by any method. At 5% nominal significance level, the TPR were 85% with RF, 80% with BRT, 26% to 49% with UFMLR, 71% to 78% with LASSO. Conversely, the FPR were 4% with RF and BRT, 9% to 2% with UFMLR, and 9% to 4% with LASSO. CONCLUSIONS: Data mining methods and LASSO should be considered as valuable methods to detect independent associations in large epidemiologic datasets. BioMed Central 2014-08-26 /pmc/articles/PMC4146451/ /pubmed/25154404 http://dx.doi.org/10.1186/1471-2288-14-99 Text en Copyright © 2014 Mansiaux and Carrat; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Mansiaux, Yohann
Carrat, Fabrice
Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections
title Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections
title_full Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections
title_fullStr Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections
title_full_unstemmed Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections
title_short Detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with H1N1pdm influenza infections
title_sort detection of independent associations in a large epidemiologic dataset: a comparison of random forests, boosted regression trees, conventional and penalized logistic regression for identifying independent factors associated with h1n1pdm influenza infections
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4146451/
https://www.ncbi.nlm.nih.gov/pubmed/25154404
http://dx.doi.org/10.1186/1471-2288-14-99
work_keys_str_mv AT mansiauxyohann detectionofindependentassociationsinalargeepidemiologicdatasetacomparisonofrandomforestsboostedregressiontreesconventionalandpenalizedlogisticregressionforidentifyingindependentfactorsassociatedwithh1n1pdminfluenzainfections
AT carratfabrice detectionofindependentassociationsinalargeepidemiologicdatasetacomparisonofrandomforestsboostedregressiontreesconventionalandpenalizedlogisticregressionforidentifyingindependentfactorsassociatedwithh1n1pdminfluenzainfections