Cargando…

Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations

BACKGROUND: Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have anal...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pahikkala, Tapio, Okser, Sebastian, Airola, Antti, Salakoski, Tapio, Aittokallio, Tero
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2012
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3606421/ https://www.ncbi.nlm.nih.gov/pubmed/22551170 http://dx.doi.org/10.1186/1748-7188-7-11

_version_	1782264008221392896
author	Pahikkala, Tapio Okser, Sebastian Airola, Antti Salakoski, Tapio Aittokallio, Tero
author_facet	Pahikkala, Tapio Okser, Sebastian Airola, Antti Salakoski, Tapio Aittokallio, Tero
author_sort	Pahikkala, Tapio
collection	PubMed
description	BACKGROUND: Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible. RESULTS: We have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension – UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high-end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS. CONCLUSIONS: Greedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/.
format	Online Article Text
id	pubmed-3606421
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-36064212013-03-27 Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations Pahikkala, Tapio Okser, Sebastian Airola, Antti Salakoski, Tapio Aittokallio, Tero Algorithms Mol Biol Research BACKGROUND: Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible. RESULTS: We have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension – UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high-end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS. CONCLUSIONS: Greedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/. BioMed Central 2012-05-02 /pmc/articles/PMC3606421/ /pubmed/22551170 http://dx.doi.org/10.1186/1748-7188-7-11 Text en Copyright ©2012 Pahikkala et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Pahikkala, Tapio Okser, Sebastian Airola, Antti Salakoski, Tapio Aittokallio, Tero Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title	Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_full	Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_fullStr	Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_full_unstemmed	Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_short	Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_sort	wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3606421/ https://www.ncbi.nlm.nih.gov/pubmed/22551170 http://dx.doi.org/10.1186/1748-7188-7-11
work_keys_str_mv	AT pahikkalatapio wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations AT oksersebastian wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations AT airolaantti wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations AT salakoskitapio wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations AT aittokalliotero wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations

Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations

Ejemplares similares