Cargando…

A machine learning pipeline for quantitative phenotype prediction from genotype data

BACKGROUND: Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machin...

Descripción completa

Detalles Bibliográficos
Autores principales:	Guzzetta, Giorgio, Jurman, Giuseppe, Furlanello, Cesare
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2966290/ https://www.ncbi.nlm.nih.gov/pubmed/21034428 http://dx.doi.org/10.1186/1471-2105-11-S8-S3

_version_	1782189566609850368
author	Guzzetta, Giorgio Jurman, Giuseppe Furlanello, Cesare
author_facet	Guzzetta, Giorgio Jurman, Giuseppe Furlanello, Cesare
author_sort	Guzzetta, Giorgio
collection	PubMed
description	BACKGROUND: Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machine learning approaches to genotype-phenotype mapping may significantly improve Genome-Wide Association Studies (GWAS) results by explicitly focusing on predictivity and optimal feature selection in a multivariate setting. It is however essential that stringent and well documented Data Analysis Protocols (DAP) are used to control sources of variability and ensure reproducibility of results. We present a genome-to-phenotype pipeline of machine learning modules for quantitative phenotype prediction. The pipeline can be applied for the direct use of whole-genome information in functional studies. As a realistic example, the problem of fitting complex phenotypic traits in heterogeneous stock mice from single nucleotide polymorphims (SNPs) is here considered. METHODS: The core element in the pipeline is the L1L2 regularization method based on the naïve elastic net. The method gives at the same time a regression model and a dimensionality reduction procedure suitable for correlated features. Model and SNP markers are selected through a DAP originally developed in the MAQC-II collaborative initiative of the U.S. FDA for the identification of clinical biomarkers from microarray data. The L1L2 approach is compared with standard Support Vector Regression (SVR) and with Recursive Jump Monte Carlo Markov Chain (MCMC). Algebraic indicators of stability of partial lists are used for model selection; the final panel of markers is obtained by a procedure at the chromosome scale, termed ’saturation’, to recover SNPs in Linkage Disequilibrium with those selected. RESULTS: With respect to both MCMC and SVR, comparable accuracies are obtained by the L1L2 pipeline. Good agreement is also found between SNPs selected by the L1L2 algorithms and candidate loci previously identified by a standard GWAS. The combination of L1L2-based feature selection with a saturation procedure tackles the issue of neglecting highly correlated features that affects many feature selection algorithms. CONCLUSIONS: The L1L2 pipeline has proven effective in terms of marker selection and prediction accuracy. This study indicates that machine learning techniques may support quantitative phenotype prediction, provided that adequate DAPs are employed to control bias in model selection.
format	Text
id	pubmed-2966290
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-29662902010-10-30 A machine learning pipeline for quantitative phenotype prediction from genotype data Guzzetta, Giorgio Jurman, Giuseppe Furlanello, Cesare BMC Bioinformatics Research BACKGROUND: Quantitative phenotypes emerge everywhere in systems biology and biomedicine due to a direct interest for quantitative traits, or to high individual variability that makes hard or impossible to classify samples into distinct categories, often the case with complex common diseases. Machine learning approaches to genotype-phenotype mapping may significantly improve Genome-Wide Association Studies (GWAS) results by explicitly focusing on predictivity and optimal feature selection in a multivariate setting. It is however essential that stringent and well documented Data Analysis Protocols (DAP) are used to control sources of variability and ensure reproducibility of results. We present a genome-to-phenotype pipeline of machine learning modules for quantitative phenotype prediction. The pipeline can be applied for the direct use of whole-genome information in functional studies. As a realistic example, the problem of fitting complex phenotypic traits in heterogeneous stock mice from single nucleotide polymorphims (SNPs) is here considered. METHODS: The core element in the pipeline is the L1L2 regularization method based on the naïve elastic net. The method gives at the same time a regression model and a dimensionality reduction procedure suitable for correlated features. Model and SNP markers are selected through a DAP originally developed in the MAQC-II collaborative initiative of the U.S. FDA for the identification of clinical biomarkers from microarray data. The L1L2 approach is compared with standard Support Vector Regression (SVR) and with Recursive Jump Monte Carlo Markov Chain (MCMC). Algebraic indicators of stability of partial lists are used for model selection; the final panel of markers is obtained by a procedure at the chromosome scale, termed ’saturation’, to recover SNPs in Linkage Disequilibrium with those selected. RESULTS: With respect to both MCMC and SVR, comparable accuracies are obtained by the L1L2 pipeline. Good agreement is also found between SNPs selected by the L1L2 algorithms and candidate loci previously identified by a standard GWAS. The combination of L1L2-based feature selection with a saturation procedure tackles the issue of neglecting highly correlated features that affects many feature selection algorithms. CONCLUSIONS: The L1L2 pipeline has proven effective in terms of marker selection and prediction accuracy. This study indicates that machine learning techniques may support quantitative phenotype prediction, provided that adequate DAPs are employed to control bias in model selection. BioMed Central 2010-10-26 /pmc/articles/PMC2966290/ /pubmed/21034428 http://dx.doi.org/10.1186/1471-2105-11-S8-S3 Text en Copyright ©2010 Furlanello et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Guzzetta, Giorgio Jurman, Giuseppe Furlanello, Cesare A machine learning pipeline for quantitative phenotype prediction from genotype data
title	A machine learning pipeline for quantitative phenotype prediction from genotype data
title_full	A machine learning pipeline for quantitative phenotype prediction from genotype data
title_fullStr	A machine learning pipeline for quantitative phenotype prediction from genotype data
title_full_unstemmed	A machine learning pipeline for quantitative phenotype prediction from genotype data
title_short	A machine learning pipeline for quantitative phenotype prediction from genotype data
title_sort	machine learning pipeline for quantitative phenotype prediction from genotype data
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2966290/ https://www.ncbi.nlm.nih.gov/pubmed/21034428 http://dx.doi.org/10.1186/1471-2105-11-S8-S3
work_keys_str_mv	AT guzzettagiorgio amachinelearningpipelineforquantitativephenotypepredictionfromgenotypedata AT jurmangiuseppe amachinelearningpipelineforquantitativephenotypepredictionfromgenotypedata AT furlanellocesare amachinelearningpipelineforquantitativephenotypepredictionfromgenotypedata AT guzzettagiorgio machinelearningpipelineforquantitativephenotypepredictionfromgenotypedata AT jurmangiuseppe machinelearningpipelineforquantitativephenotypepredictionfromgenotypedata AT furlanellocesare machinelearningpipelineforquantitativephenotypepredictionfromgenotypedata

A machine learning pipeline for quantitative phenotype prediction from genotype data

Ejemplares similares