Cargando…

Performance of a blockwise approach in variable selection using linkage disequilibrium information

BACKGROUND: Genome-wide association studies (GWAS) aim at finding genetic markers that are significantly associated with a phenotype of interest. Single nucleotide polymorphism (SNP) data from the entire genome are collected for many thousands of SNP markers, leading to high-dimensional regression p...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dehman, Alia, Ambroise, Christophe, Neuvial, Pierre
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4430909/ https://www.ncbi.nlm.nih.gov/pubmed/25951947 http://dx.doi.org/10.1186/s12859-015-0556-6

_version_	1782371250601984000
author	Dehman, Alia Ambroise, Christophe Neuvial, Pierre
author_facet	Dehman, Alia Ambroise, Christophe Neuvial, Pierre
author_sort	Dehman, Alia
collection	PubMed
description	BACKGROUND: Genome-wide association studies (GWAS) aim at finding genetic markers that are significantly associated with a phenotype of interest. Single nucleotide polymorphism (SNP) data from the entire genome are collected for many thousands of SNP markers, leading to high-dimensional regression problems where the number of predictors greatly exceeds the number of observations. Moreover, these predictors are statistically dependent, in particular due to linkage disequilibrium (LD). We propose a three-step approach that explicitly takes advantage of the grouping structure induced by LD in order to identify common variants which may have been missed by single marker analyses (SMA). In the first step, we perform a hierarchical clustering of SNPs with an adjacency constraint using LD as a similarity measure. In the second step, we apply a model selection approach to the obtained hierarchy in order to define LD blocks. Finally, we perform Group Lasso regression on the inferred LD blocks. We investigate the efficiency of this approach compared to state-of-the art regression methods: haplotype association tests, SMA, and Lasso and Elastic-Net regressions. RESULTS: Our results on simulated data show that the proposed method performs better than state-of-the-art approaches as soon as the number of causal SNPs within an LD block exceeds 2. Our results on semi-simulated data and a previously published HIV data set illustrate the relevance of the proposed method and its robustness to a real LD structure. The method is implemented in the R package BALD (Blockwise Approach using Linkage Disequilibrium), available from http://www.math-evry.cnrs.fr/publications/logiciels. CONCLUSIONS: Our results show that the proposed method is efficient not only at the level of LD blocks by inferring well the underlying block structure but also at the level of individual SNPs. Thus, this study demonstrates the importance of tailored integration of biological knowledge in high-dimensional genomic studies such as GWAS. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0556-6) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4430909
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-44309092015-05-15 Performance of a blockwise approach in variable selection using linkage disequilibrium information Dehman, Alia Ambroise, Christophe Neuvial, Pierre BMC Bioinformatics Research Article BACKGROUND: Genome-wide association studies (GWAS) aim at finding genetic markers that are significantly associated with a phenotype of interest. Single nucleotide polymorphism (SNP) data from the entire genome are collected for many thousands of SNP markers, leading to high-dimensional regression problems where the number of predictors greatly exceeds the number of observations. Moreover, these predictors are statistically dependent, in particular due to linkage disequilibrium (LD). We propose a three-step approach that explicitly takes advantage of the grouping structure induced by LD in order to identify common variants which may have been missed by single marker analyses (SMA). In the first step, we perform a hierarchical clustering of SNPs with an adjacency constraint using LD as a similarity measure. In the second step, we apply a model selection approach to the obtained hierarchy in order to define LD blocks. Finally, we perform Group Lasso regression on the inferred LD blocks. We investigate the efficiency of this approach compared to state-of-the art regression methods: haplotype association tests, SMA, and Lasso and Elastic-Net regressions. RESULTS: Our results on simulated data show that the proposed method performs better than state-of-the-art approaches as soon as the number of causal SNPs within an LD block exceeds 2. Our results on semi-simulated data and a previously published HIV data set illustrate the relevance of the proposed method and its robustness to a real LD structure. The method is implemented in the R package BALD (Blockwise Approach using Linkage Disequilibrium), available from http://www.math-evry.cnrs.fr/publications/logiciels. CONCLUSIONS: Our results show that the proposed method is efficient not only at the level of LD blocks by inferring well the underlying block structure but also at the level of individual SNPs. Thus, this study demonstrates the importance of tailored integration of biological knowledge in high-dimensional genomic studies such as GWAS. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0556-6) contains supplementary material, which is available to authorized users. BioMed Central 2015-05-08 /pmc/articles/PMC4430909/ /pubmed/25951947 http://dx.doi.org/10.1186/s12859-015-0556-6 Text en © Dehman et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Dehman, Alia Ambroise, Christophe Neuvial, Pierre Performance of a blockwise approach in variable selection using linkage disequilibrium information
title	Performance of a blockwise approach in variable selection using linkage disequilibrium information
title_full	Performance of a blockwise approach in variable selection using linkage disequilibrium information
title_fullStr	Performance of a blockwise approach in variable selection using linkage disequilibrium information
title_full_unstemmed	Performance of a blockwise approach in variable selection using linkage disequilibrium information
title_short	Performance of a blockwise approach in variable selection using linkage disequilibrium information
title_sort	performance of a blockwise approach in variable selection using linkage disequilibrium information
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4430909/ https://www.ncbi.nlm.nih.gov/pubmed/25951947 http://dx.doi.org/10.1186/s12859-015-0556-6
work_keys_str_mv	AT dehmanalia performanceofablockwiseapproachinvariableselectionusinglinkagedisequilibriuminformation AT ambroisechristophe performanceofablockwiseapproachinvariableselectionusinglinkagedisequilibriuminformation AT neuvialpierre performanceofablockwiseapproachinvariableselectionusinglinkagedisequilibriuminformation

Performance of a blockwise approach in variable selection using linkage disequilibrium information

Ejemplares similares