Cargando…
VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data
BACKGROUND: Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes b...
Autores principales: | , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7407261/ https://www.ncbi.nlm.nih.gov/pubmed/32761098 http://dx.doi.org/10.1093/gigascience/giaa077 |
_version_ | 1783567586640789504 |
---|---|
author | Bayat, Arash Szul, Piotr O’Brien, Aidan R Dunne, Robert Hosking, Brendan Jain, Yatish Hosking, Cameron Luo, Oscar J Twine, Natalie Bauer, Denis C |
author_facet | Bayat, Arash Szul, Piotr O’Brien, Aidan R Dunne, Robert Hosking, Brendan Jain, Yatish Hosking, Cameron Luo, Oscar J Twine, Natalie Bauer, Denis C |
author_sort | Bayat, Arash |
collection | PubMed |
description | BACKGROUND: Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions. FINDINGS: We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples. CONCLUSIONS: Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time. |
format | Online Article Text |
id | pubmed-7407261 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-74072612020-08-10 VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data Bayat, Arash Szul, Piotr O’Brien, Aidan R Dunne, Robert Hosking, Brendan Jain, Yatish Hosking, Cameron Luo, Oscar J Twine, Natalie Bauer, Denis C Gigascience Technical Note BACKGROUND: Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions. FINDINGS: We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples. CONCLUSIONS: Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time. Oxford University Press 2020-08-06 /pmc/articles/PMC7407261/ /pubmed/32761098 http://dx.doi.org/10.1093/gigascience/giaa077 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Technical Note Bayat, Arash Szul, Piotr O’Brien, Aidan R Dunne, Robert Hosking, Brendan Jain, Yatish Hosking, Cameron Luo, Oscar J Twine, Natalie Bauer, Denis C VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data |
title | VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data |
title_full | VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data |
title_fullStr | VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data |
title_full_unstemmed | VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data |
title_short | VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data |
title_sort | variantspark: cloud-based machine learning for association study of complex phenotype and large-scale genomic data |
topic | Technical Note |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7407261/ https://www.ncbi.nlm.nih.gov/pubmed/32761098 http://dx.doi.org/10.1093/gigascience/giaa077 |
work_keys_str_mv | AT bayatarash variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata AT szulpiotr variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata AT obrienaidanr variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata AT dunnerobert variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata AT hoskingbrendan variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata AT jainyatish variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata AT hoskingcameron variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata AT luooscarj variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata AT twinenatalie variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata AT bauerdenisc variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata |