Cargando…

Optimized application of penalized regression methods to diverse genomic data

Motivation: Penalized regression methods have been adopted widely for high-dimensional feature selection and prediction in many bioinformatic and biostatistical contexts. While their theoretical properties are well-understood, specific methodology for their optimal application to genomic data has no...

Descripción completa

Detalles Bibliográficos
Autores principales: Waldron, Levi, Pintilie, Melania, Tsao, Ming-Sound, Shepherd, Frances A., Huttenhower, Curtis, Jurisica, Igor
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3232376/
https://www.ncbi.nlm.nih.gov/pubmed/22156367
http://dx.doi.org/10.1093/bioinformatics/btr591
_version_ 1782218350141636608
author Waldron, Levi
Pintilie, Melania
Tsao, Ming-Sound
Shepherd, Frances A.
Huttenhower, Curtis
Jurisica, Igor
author_facet Waldron, Levi
Pintilie, Melania
Tsao, Ming-Sound
Shepherd, Frances A.
Huttenhower, Curtis
Jurisica, Igor
author_sort Waldron, Levi
collection PubMed
description Motivation: Penalized regression methods have been adopted widely for high-dimensional feature selection and prediction in many bioinformatic and biostatistical contexts. While their theoretical properties are well-understood, specific methodology for their optimal application to genomic data has not been determined. Results: Through simulation of contrasting scenarios of correlated high-dimensional survival data, we compared the LASSO, Ridge and Elastic Net penalties for prediction and variable selection. We found that a 2D tuning of the Elastic Net penalties was necessary to avoid mimicking the performance of LASSO or Ridge regression. Furthermore, we found that in a simulated scenario favoring the LASSO penalty, a univariate pre-filter made the Elastic Net behave more like Ridge regression, which was detrimental to prediction performance. We demonstrate the real-life application of these methods to predicting the survival of cancer patients from microarray data, and to classification of obese and lean individuals from metagenomic data. Based on these results, we provide an optimized set of guidelines for the application of penalized regression for reproducible class comparison and prediction with genomic data. Availability and Implementation: A parallelized implementation of the methods presented for regression and for simulation of synthetic data is provided as the pensim R package, available at http://cran.r-project.org/web/packages/pensim/index.html. Contact: chuttenh@hsph.harvard.edu; juris@ai.utoronto.ca Supplementary Information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-3232376
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-32323762011-12-07 Optimized application of penalized regression methods to diverse genomic data Waldron, Levi Pintilie, Melania Tsao, Ming-Sound Shepherd, Frances A. Huttenhower, Curtis Jurisica, Igor Bioinformatics Original Papers Motivation: Penalized regression methods have been adopted widely for high-dimensional feature selection and prediction in many bioinformatic and biostatistical contexts. While their theoretical properties are well-understood, specific methodology for their optimal application to genomic data has not been determined. Results: Through simulation of contrasting scenarios of correlated high-dimensional survival data, we compared the LASSO, Ridge and Elastic Net penalties for prediction and variable selection. We found that a 2D tuning of the Elastic Net penalties was necessary to avoid mimicking the performance of LASSO or Ridge regression. Furthermore, we found that in a simulated scenario favoring the LASSO penalty, a univariate pre-filter made the Elastic Net behave more like Ridge regression, which was detrimental to prediction performance. We demonstrate the real-life application of these methods to predicting the survival of cancer patients from microarray data, and to classification of obese and lean individuals from metagenomic data. Based on these results, we provide an optimized set of guidelines for the application of penalized regression for reproducible class comparison and prediction with genomic data. Availability and Implementation: A parallelized implementation of the methods presented for regression and for simulation of synthetic data is provided as the pensim R package, available at http://cran.r-project.org/web/packages/pensim/index.html. Contact: chuttenh@hsph.harvard.edu; juris@ai.utoronto.ca Supplementary Information: Supplementary data are available at Bioinformatics online. Oxford University Press 2011-12-15 2011-10-24 /pmc/articles/PMC3232376/ /pubmed/22156367 http://dx.doi.org/10.1093/bioinformatics/btr591 Text en © The Author(s) 2011. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/3.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Waldron, Levi
Pintilie, Melania
Tsao, Ming-Sound
Shepherd, Frances A.
Huttenhower, Curtis
Jurisica, Igor
Optimized application of penalized regression methods to diverse genomic data
title Optimized application of penalized regression methods to diverse genomic data
title_full Optimized application of penalized regression methods to diverse genomic data
title_fullStr Optimized application of penalized regression methods to diverse genomic data
title_full_unstemmed Optimized application of penalized regression methods to diverse genomic data
title_short Optimized application of penalized regression methods to diverse genomic data
title_sort optimized application of penalized regression methods to diverse genomic data
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3232376/
https://www.ncbi.nlm.nih.gov/pubmed/22156367
http://dx.doi.org/10.1093/bioinformatics/btr591
work_keys_str_mv AT waldronlevi optimizedapplicationofpenalizedregressionmethodstodiversegenomicdata
AT pintiliemelania optimizedapplicationofpenalizedregressionmethodstodiversegenomicdata
AT tsaomingsound optimizedapplicationofpenalizedregressionmethodstodiversegenomicdata
AT shepherdfrancesa optimizedapplicationofpenalizedregressionmethodstodiversegenomicdata
AT huttenhowercurtis optimizedapplicationofpenalizedregressionmethodstodiversegenomicdata
AT jurisicaigor optimizedapplicationofpenalizedregressionmethodstodiversegenomicdata