Cargando…

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests

BACKGROUND: Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the d...

Descripción completa

Detalles Bibliográficos
Autores principales:	Nguyen, Thanh-Tung, Huang, Joshua Zhexue, Wu, Qingyao, Nguyen, Thuy Thi, Li, Mark Junjie
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331719/ https://www.ncbi.nlm.nih.gov/pubmed/25708662 http://dx.doi.org/10.1186/1471-2164-16-S2-S5

_version_	1782357766184108032
author	Nguyen, Thanh-Tung Huang, Joshua Zhexue Wu, Qingyao Nguyen, Thuy Thi Li, Mark Junjie
author_facet	Nguyen, Thanh-Tung Huang, Joshua Zhexue Wu, Qingyao Nguyen, Thuy Thi Li, Mark Junjie
author_sort	Nguyen, Thanh-Tung
collection	PubMed
description	BACKGROUND: Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree. RESULTS: This approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders. CONCLUSION: The presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breiman's RF, GRRF and wsRF methods.
format	Online Article Text
id	pubmed-4331719
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-43317192015-03-19 Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests Nguyen, Thanh-Tung Huang, Joshua Zhexue Wu, Qingyao Nguyen, Thuy Thi Li, Mark Junjie BMC Genomics Proceedings BACKGROUND: Single-nucleotide polymorphisms (SNPs) selection and identification are the most important tasks in Genome-wide association data analysis. The problem is difficult because genome-wide association data is very high dimensional and a large portion of SNPs in the data is irrelevant to the disease. Advanced machine learning methods have been successfully used in Genome-wide association studies (GWAS) for identification of genetic variants that have relatively big effects in some common, complex diseases. Among them, the most successful one is Random Forests (RF). Despite of performing well in terms of prediction accuracy in some data sets with moderate size, RF still suffers from working in GWAS for selecting informative SNPs and building accurate prediction models. In this paper, we propose to use a new two-stage quality-based sampling method in random forests, named ts-RF, for SNP subspace selection for GWAS. The method first applies p-value assessment to find a cut-off point that separates informative and irrelevant SNPs in two groups. The informative SNPs group is further divided into two sub-groups: highly informative and weak informative SNPs. When sampling the SNP subspace for building trees for the forest, only those SNPs from the two sub-groups are taken into account. The feature subspaces always contain highly informative SNPs when used to split a node at a tree. RESULTS: This approach enables one to generate more accurate trees with a lower prediction error, meanwhile possibly avoiding overfitting. It allows one to detect interactions of multiple SNPs with the diseases, and to reduce the dimensionality and the amount of Genome-wide association data needed for learning the RF model. Extensive experiments on two genome-wide SNP data sets (Parkinson case-control data comprised of 408,803 SNPs and Alzheimer case-control data comprised of 380,157 SNPs) and 10 gene data sets have demonstrated that the proposed model significantly reduced prediction errors and outperformed most existing the-state-of-the-art random forests. The top 25 SNPs in Parkinson data set were identified by the proposed model including four interesting genes associated with neurological disorders. CONCLUSION: The presented approach has shown to be effective in selecting informative sub-groups of SNPs potentially associated with diseases that traditional statistical approaches might fail. The new RF works well for the data where the number of case-control objects is much smaller than the number of SNPs, which is a typical problem in gene data and GWAS. Experiment results demonstrated the effectiveness of the proposed RF model that outperformed the state-of-the-art RFs, including Breiman's RF, GRRF and wsRF methods. BioMed Central 2015-01-21 /pmc/articles/PMC4331719/ /pubmed/25708662 http://dx.doi.org/10.1186/1471-2164-16-S2-S5 Text en Copyright © 2015 Nguyen et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Nguyen, Thanh-Tung Huang, Joshua Zhexue Wu, Qingyao Nguyen, Thuy Thi Li, Mark Junjie Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests
title	Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests
title_full	Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests
title_fullStr	Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests
title_full_unstemmed	Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests
title_short	Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests
title_sort	genome-wide association data classification and snps selection using two-stage quality-based random forests
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4331719/ https://www.ncbi.nlm.nih.gov/pubmed/25708662 http://dx.doi.org/10.1186/1471-2164-16-S2-S5
work_keys_str_mv	AT nguyenthanhtung genomewideassociationdataclassificationandsnpsselectionusingtwostagequalitybasedrandomforests AT huangjoshuazhexue genomewideassociationdataclassificationandsnpsselectionusingtwostagequalitybasedrandomforests AT wuqingyao genomewideassociationdataclassificationandsnpsselectionusingtwostagequalitybasedrandomforests AT nguyenthuythi genomewideassociationdataclassificationandsnpsselectionusingtwostagequalitybasedrandomforests AT limarkjunjie genomewideassociationdataclassificationandsnpsselectionusingtwostagequalitybasedrandomforests

Genome-wide association data classification and SNPs selection using two-stage quality-based Random Forests

Ejemplares similares