Cargando…
Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes
MOTIVATION: Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2013
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3853073/ https://www.ncbi.nlm.nih.gov/pubmed/24564704 http://dx.doi.org/10.1186/1471-2105-14-S16-S6 |
_version_ | 1782478777923665920 |
---|---|
author | Wang, Yue Goh, Wilson Wong, Limsoon Montana, Giovanni |
author_facet | Wang, Yue Goh, Wilson Wong, Limsoon Montana, Giovanni |
author_sort | Wang, Yue |
collection | PubMed |
description | MOTIVATION: Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. RESULTS: We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. AVAILABILITY: The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana. |
format | Online Article Text |
id | pubmed-3853073 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2013 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-38530732013-12-16 Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes Wang, Yue Goh, Wilson Wong, Limsoon Montana, Giovanni BMC Bioinformatics Research MOTIVATION: Multivariate quantitative traits arise naturally in recent neuroimaging genetics studies, in which both structural and functional variability of the human brain is measured non-invasively through techniques such as magnetic resonance imaging (MRI). There is growing interest in detecting genetic variants associated with such multivariate traits, especially in genome-wide studies. Random forests (RFs) classifiers, which are ensembles of decision trees, are amongst the best performing machine learning algorithms and have been successfully employed for the prioritisation of genetic variants in case-control studies. RFs can also be applied to produce gene rankings in association studies with multivariate quantitative traits, and to estimate genetic similarities measures that are predictive of the trait. However, in studies involving hundreds of thousands of SNPs and high-dimensional traits, a very large ensemble of trees must be inferred from the data in order to obtain reliable rankings, which makes the application of these algorithms computationally prohibitive. RESULTS: We have developed a parallel version of the RF algorithm for regression and genetic similarity learning tasks in large-scale population genetic association studies involving multivariate traits, called PaRFR (Parallel Random Forest Regression). Our implementation takes advantage of the MapReduce programming model and is deployed on Hadoop, an open-source software framework that supports data-intensive distributed applications. Notable speed-ups are obtained by introducing a distance-based criterion for node splitting in the tree estimation process. PaRFR has been applied to a genome-wide association study on Alzheimer's disease (AD) in which the quantitative trait consists of a high-dimensional neuroimaging phenotype describing longitudinal changes in the human brain structure. PaRFR provides a ranking of SNPs associated to this trait, and produces pair-wise measures of genetic proximity that can be directly compared to pair-wise measures of phenotypic proximity. Several known AD-related variants have been identified, including APOE4 and TOMM40. We also present experimental evidence supporting the hypothesis of a linear relationship between the number of top-ranked mutated states, or frequent mutation patterns, and an indicator of disease severity. AVAILABILITY: The Java codes are freely available at http://www2.imperial.ac.uk/~gmontana. BioMed Central 2013-10-22 /pmc/articles/PMC3853073/ /pubmed/24564704 http://dx.doi.org/10.1186/1471-2105-14-S16-S6 Text en Copyright © 2013 Wang et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Wang, Yue Goh, Wilson Wong, Limsoon Montana, Giovanni Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes |
title | Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes |
title_full | Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes |
title_fullStr | Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes |
title_full_unstemmed | Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes |
title_short | Random forests on Hadoop for genome-wide association studies of multivariate neuroimaging phenotypes |
title_sort | random forests on hadoop for genome-wide association studies of multivariate neuroimaging phenotypes |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3853073/ https://www.ncbi.nlm.nih.gov/pubmed/24564704 http://dx.doi.org/10.1186/1471-2105-14-S16-S6 |
work_keys_str_mv | AT wangyue randomforestsonhadoopforgenomewideassociationstudiesofmultivariateneuroimagingphenotypes AT gohwilson randomforestsonhadoopforgenomewideassociationstudiesofmultivariateneuroimagingphenotypes AT wonglimsoon randomforestsonhadoopforgenomewideassociationstudiesofmultivariateneuroimagingphenotypes AT montanagiovanni randomforestsonhadoopforgenomewideassociationstudiesofmultivariateneuroimagingphenotypes AT randomforestsonhadoopforgenomewideassociationstudiesofmultivariateneuroimagingphenotypes |