Cargando…

Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations

BACKGROUND: Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have anal...

Descripción completa

Detalles Bibliográficos
Autores principales: Pahikkala, Tapio, Okser, Sebastian, Airola, Antti, Salakoski, Tapio, Aittokallio, Tero
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3606421/
https://www.ncbi.nlm.nih.gov/pubmed/22551170
http://dx.doi.org/10.1186/1748-7188-7-11
_version_ 1782264008221392896
author Pahikkala, Tapio
Okser, Sebastian
Airola, Antti
Salakoski, Tapio
Aittokallio, Tero
author_facet Pahikkala, Tapio
Okser, Sebastian
Airola, Antti
Salakoski, Tapio
Aittokallio, Tero
author_sort Pahikkala, Tapio
collection PubMed
description BACKGROUND: Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible. RESULTS: We have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension – UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high-end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS. CONCLUSIONS: Greedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/.
format Online
Article
Text
id pubmed-3606421
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36064212013-03-27 Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations Pahikkala, Tapio Okser, Sebastian Airola, Antti Salakoski, Tapio Aittokallio, Tero Algorithms Mol Biol Research BACKGROUND: Through the wealth of information contained within them, genome-wide association studies (GWAS) have the potential to provide researchers with a systematic means of associating genetic variants with a wide variety of disease phenotypes. Due to the limitations of approaches that have analyzed single variants one at a time, it has been proposed that the genetic basis of these disorders could be determined through detailed analysis of the genetic variants themselves and in conjunction with one another. The construction of models that account for these subsets of variants requires methodologies that generate predictions based on the total risk of a particular group of polymorphisms. However, due to the excessive number of variants, constructing these types of models has so far been computationally infeasible. RESULTS: We have implemented an algorithm, known as greedy RLS, that we use to perform the first known wrapper-based feature selection on the genome-wide level. The running time of greedy RLS grows linearly in the number of training examples, the number of features in the original data set, and the number of selected features. This speed is achieved through computational short-cuts based on matrix calculus. Since the memory consumption in present-day computers can form an even tighter bottleneck than running time, we also developed a space efficient variation of greedy RLS which trades running time for memory. These approaches are then compared to traditional wrapper-based feature selection implementations based on support vector machines (SVM) to reveal the relative speed-up and to assess the feasibility of the new algorithm. As a proof of concept, we apply greedy RLS to the Hypertension – UK National Blood Service WTCCC dataset and select the most predictive variants using 3-fold external cross-validation in less than 26 minutes on a high-end desktop. On this dataset, we also show that greedy RLS has a better classification performance on independent test data than a classifier trained using features selected by a statistical p-value-based filter, which is currently the most popular approach for constructing predictive models in GWAS. CONCLUSIONS: Greedy RLS is the first known implementation of a machine learning based method with the capability to conduct a wrapper-based feature selection on an entire GWAS containing several thousand examples and over 400,000 variants. In our experiments, greedy RLS selected a highly predictive subset of genetic variants in a fraction of the time spent by wrapper-based selection methods used together with SVM classifiers. The proposed algorithms are freely available as part of the RLScore software library at http://users.utu.fi/aatapa/RLScore/. BioMed Central 2012-05-02 /pmc/articles/PMC3606421/ /pubmed/22551170 http://dx.doi.org/10.1186/1748-7188-7-11 Text en Copyright ©2012 Pahikkala et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Pahikkala, Tapio
Okser, Sebastian
Airola, Antti
Salakoski, Tapio
Aittokallio, Tero
Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_full Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_fullStr Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_full_unstemmed Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_short Wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
title_sort wrapper-based selection of genetic features in genome-wide association studies through fast matrix operations
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3606421/
https://www.ncbi.nlm.nih.gov/pubmed/22551170
http://dx.doi.org/10.1186/1748-7188-7-11
work_keys_str_mv AT pahikkalatapio wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations
AT oksersebastian wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations
AT airolaantti wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations
AT salakoskitapio wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations
AT aittokalliotero wrapperbasedselectionofgeneticfeaturesingenomewideassociationstudiesthroughfastmatrixoperations