Cargando…

A forest-based feature screening approach for large-scale genome data with complex structures

BACKGROUND: Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biomedical traits. When the number of SNPs dramatically increases to half million but the sample size is still limited to thousands, the traditional p-value ba...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wang, Gang, Fu, Guifang, Corcoran, Christopher
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4690313/ https://www.ncbi.nlm.nih.gov/pubmed/26698561 http://dx.doi.org/10.1186/s12863-015-0294-9

_version_	1782406992852156416
author	Wang, Gang Fu, Guifang Corcoran, Christopher
author_facet	Wang, Gang Fu, Guifang Corcoran, Christopher
author_sort	Wang, Gang
collection	PubMed
description	BACKGROUND: Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biomedical traits. When the number of SNPs dramatically increases to half million but the sample size is still limited to thousands, the traditional p-value based statistical approaches suffer from unprecedented limitations. Feature screening has proved to be an effective and powerful approach to handle ultrahigh dimensional data statistically, yet it has not received much attention in GWAS. Feature screening reduces the feature space from millions to hundreds by removing non-informative noise. However, the univariate measures used to rank features are mainly based on individual effect without considering the mutual interactions with other features. In this article, we explore the performance of a random forest (RF) based feature screening procedure to emphasize the SNPs that have complex effects for a continuous phenotype. RESULTS: Both simulation and real data analysis are conducted to examine the power of the forest-based feature screening. We compare it with five other popular feature screening approaches via simulation and conclude that RF can serve as a decent feature screening tool to accommodate complex genetic effects such as nonlinear, interactive, correlative, and joint effects. Unlike the traditional p-value based Manhattan plot, we use the Permutation Variable Importance Measure (PVIM) to display the relative significance and believe that it will provide as much useful information as the traditional plot. CONCLUSION: Most complex traits are found to be regulated by epistatic and polygenic variants. The forest-based feature screening is proven to be an efficient, easily implemented, and accurate approach to cope whole genome data with complex structures. Our explorations should add to a growing body of enlargement of feature screening better serving the demands of contemporary genome data.
format	Online Article Text
id	pubmed-4690313
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-46903132015-12-25 A forest-based feature screening approach for large-scale genome data with complex structures Wang, Gang Fu, Guifang Corcoran, Christopher BMC Genet Research Article BACKGROUND: Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biomedical traits. When the number of SNPs dramatically increases to half million but the sample size is still limited to thousands, the traditional p-value based statistical approaches suffer from unprecedented limitations. Feature screening has proved to be an effective and powerful approach to handle ultrahigh dimensional data statistically, yet it has not received much attention in GWAS. Feature screening reduces the feature space from millions to hundreds by removing non-informative noise. However, the univariate measures used to rank features are mainly based on individual effect without considering the mutual interactions with other features. In this article, we explore the performance of a random forest (RF) based feature screening procedure to emphasize the SNPs that have complex effects for a continuous phenotype. RESULTS: Both simulation and real data analysis are conducted to examine the power of the forest-based feature screening. We compare it with five other popular feature screening approaches via simulation and conclude that RF can serve as a decent feature screening tool to accommodate complex genetic effects such as nonlinear, interactive, correlative, and joint effects. Unlike the traditional p-value based Manhattan plot, we use the Permutation Variable Importance Measure (PVIM) to display the relative significance and believe that it will provide as much useful information as the traditional plot. CONCLUSION: Most complex traits are found to be regulated by epistatic and polygenic variants. The forest-based feature screening is proven to be an efficient, easily implemented, and accurate approach to cope whole genome data with complex structures. Our explorations should add to a growing body of enlargement of feature screening better serving the demands of contemporary genome data. BioMed Central 2015-12-23 /pmc/articles/PMC4690313/ /pubmed/26698561 http://dx.doi.org/10.1186/s12863-015-0294-9 Text en © Wang et al. 2015 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Wang, Gang Fu, Guifang Corcoran, Christopher A forest-based feature screening approach for large-scale genome data with complex structures
title	A forest-based feature screening approach for large-scale genome data with complex structures
title_full	A forest-based feature screening approach for large-scale genome data with complex structures
title_fullStr	A forest-based feature screening approach for large-scale genome data with complex structures
title_full_unstemmed	A forest-based feature screening approach for large-scale genome data with complex structures
title_short	A forest-based feature screening approach for large-scale genome data with complex structures
title_sort	forest-based feature screening approach for large-scale genome data with complex structures
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4690313/ https://www.ncbi.nlm.nih.gov/pubmed/26698561 http://dx.doi.org/10.1186/s12863-015-0294-9
work_keys_str_mv	AT wanggang aforestbasedfeaturescreeningapproachforlargescalegenomedatawithcomplexstructures AT fuguifang aforestbasedfeaturescreeningapproachforlargescalegenomedatawithcomplexstructures AT corcoranchristopher aforestbasedfeaturescreeningapproachforlargescalegenomedatawithcomplexstructures AT wanggang forestbasedfeaturescreeningapproachforlargescalegenomedatawithcomplexstructures AT fuguifang forestbasedfeaturescreeningapproachforlargescalegenomedatawithcomplexstructures AT corcoranchristopher forestbasedfeaturescreeningapproachforlargescalegenomedatawithcomplexstructures

A forest-based feature screening approach for large-scale genome data with complex structures

Ejemplares similares