Cargando…

Evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study

Genomics-based technologies produce large amounts of data. To interpret the results and identify the most important variates related to phenotypes of interest, various multivariate regression and variate selection methods are used. Although inspected for statistical performance, the relevance of mul...

Descripción completa

Detalles Bibliográficos
Autores principales: Tapp, Henri S., Radonjic, Marijana, Kate Kemsley, E., Thissen, Uwe
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer-Verlag 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380194/
https://www.ncbi.nlm.nih.gov/pubmed/22382778
http://dx.doi.org/10.1007/s12263-012-0288-4
_version_ 1782236300334596096
author Tapp, Henri S.
Radonjic, Marijana
Kate Kemsley, E.
Thissen, Uwe
author_facet Tapp, Henri S.
Radonjic, Marijana
Kate Kemsley, E.
Thissen, Uwe
author_sort Tapp, Henri S.
collection PubMed
description Genomics-based technologies produce large amounts of data. To interpret the results and identify the most important variates related to phenotypes of interest, various multivariate regression and variate selection methods are used. Although inspected for statistical performance, the relevance of multivariate models in interpreting biological data sets often remains elusive. We compare various multivariate regression and variate selection methods applied to a nutrigenomics data set in terms of performance, utility and biological interpretability. The studied data set comprised hepatic transcriptome (10,072 predictor variates) and plasma protein concentrations [2 dependent variates: Leptin (LEP) and Tissue inhibitor of metalloproteinase 1 (TIMP-1)] collected during a high-fat diet study in ApoE3Leiden mice. The multivariate regression methods used were: partial least squares “PLS”; a genetic algorithm-based multiple linear regression, “GA-MLR”; two least-angle shrinkage methods, “LASSO” and “ELASTIC NET”; and a variant of PLS that uses covariance-based variate selection, “CovProc.” Two methods of ranking the genes for Gene Set Enrichment Analysis (GSEA) were also investigated: either by their correlation with the protein data or by the stability of the PLS regression coefficients. The regression methods performed similarly, with CovProc and GA performing the best and worst, respectively (R-squared values based on “double cross-validation” predictions of 0.762 and 0.451 for LEP; and 0.701 and 0.482 for TIMP-1). CovProc, LASSO and ELASTIC NET all produced parsimonious regression models and consistently identified small subsets of variates, with high commonality between the methods. Comparison of the gene ranking approaches found a high degree of agreement, with PLS-based ranking finding fewer significant gene sets. We recommend the use of CovProc for variate selection, in tandem with univariate methods, and the use of correlation-based ranking for GSEA-like pathway analysis methods. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s12263-012-0288-4) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-3380194
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher Springer-Verlag
record_format MEDLINE/PubMed
spelling pubmed-33801942012-07-06 Evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study Tapp, Henri S. Radonjic, Marijana Kate Kemsley, E. Thissen, Uwe Genes Nutr Research Paper Genomics-based technologies produce large amounts of data. To interpret the results and identify the most important variates related to phenotypes of interest, various multivariate regression and variate selection methods are used. Although inspected for statistical performance, the relevance of multivariate models in interpreting biological data sets often remains elusive. We compare various multivariate regression and variate selection methods applied to a nutrigenomics data set in terms of performance, utility and biological interpretability. The studied data set comprised hepatic transcriptome (10,072 predictor variates) and plasma protein concentrations [2 dependent variates: Leptin (LEP) and Tissue inhibitor of metalloproteinase 1 (TIMP-1)] collected during a high-fat diet study in ApoE3Leiden mice. The multivariate regression methods used were: partial least squares “PLS”; a genetic algorithm-based multiple linear regression, “GA-MLR”; two least-angle shrinkage methods, “LASSO” and “ELASTIC NET”; and a variant of PLS that uses covariance-based variate selection, “CovProc.” Two methods of ranking the genes for Gene Set Enrichment Analysis (GSEA) were also investigated: either by their correlation with the protein data or by the stability of the PLS regression coefficients. The regression methods performed similarly, with CovProc and GA performing the best and worst, respectively (R-squared values based on “double cross-validation” predictions of 0.762 and 0.451 for LEP; and 0.701 and 0.482 for TIMP-1). CovProc, LASSO and ELASTIC NET all produced parsimonious regression models and consistently identified small subsets of variates, with high commonality between the methods. Comparison of the gene ranking approaches found a high degree of agreement, with PLS-based ranking finding fewer significant gene sets. We recommend the use of CovProc for variate selection, in tandem with univariate methods, and the use of correlation-based ranking for GSEA-like pathway analysis methods. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s12263-012-0288-4) contains supplementary material, which is available to authorized users. Springer-Verlag 2012-03-02 /pmc/articles/PMC3380194/ /pubmed/22382778 http://dx.doi.org/10.1007/s12263-012-0288-4 Text en © The Author(s) 2012 https://creativecommons.org/licenses/by/4.0/ This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
spellingShingle Research Paper
Tapp, Henri S.
Radonjic, Marijana
Kate Kemsley, E.
Thissen, Uwe
Evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study
title Evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study
title_full Evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study
title_fullStr Evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study
title_full_unstemmed Evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study
title_short Evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study
title_sort evaluation of multiple variate selection methods from a biological perspective: a nutrigenomics case study
topic Research Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3380194/
https://www.ncbi.nlm.nih.gov/pubmed/22382778
http://dx.doi.org/10.1007/s12263-012-0288-4
work_keys_str_mv AT tapphenris evaluationofmultiplevariateselectionmethodsfromabiologicalperspectiveanutrigenomicscasestudy
AT radonjicmarijana evaluationofmultiplevariateselectionmethodsfromabiologicalperspectiveanutrigenomicscasestudy
AT katekemsleye evaluationofmultiplevariateselectionmethodsfromabiologicalperspectiveanutrigenomicscasestudy
AT thissenuwe evaluationofmultiplevariateselectionmethodsfromabiologicalperspectiveanutrigenomicscasestudy