Cargando…

VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data

BACKGROUND: Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes b...

Descripción completa

Detalles Bibliográficos
Autores principales: Bayat, Arash, Szul, Piotr, O’Brien, Aidan R, Dunne, Robert, Hosking, Brendan, Jain, Yatish, Hosking, Cameron, Luo, Oscar J, Twine, Natalie, Bauer, Denis C
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7407261/
https://www.ncbi.nlm.nih.gov/pubmed/32761098
http://dx.doi.org/10.1093/gigascience/giaa077
_version_ 1783567586640789504
author Bayat, Arash
Szul, Piotr
O’Brien, Aidan R
Dunne, Robert
Hosking, Brendan
Jain, Yatish
Hosking, Cameron
Luo, Oscar J
Twine, Natalie
Bauer, Denis C
author_facet Bayat, Arash
Szul, Piotr
O’Brien, Aidan R
Dunne, Robert
Hosking, Brendan
Jain, Yatish
Hosking, Cameron
Luo, Oscar J
Twine, Natalie
Bauer, Denis C
author_sort Bayat, Arash
collection PubMed
description BACKGROUND: Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions. FINDINGS: We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples. CONCLUSIONS: Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time.
format Online
Article
Text
id pubmed-7407261
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-74072612020-08-10 VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data Bayat, Arash Szul, Piotr O’Brien, Aidan R Dunne, Robert Hosking, Brendan Jain, Yatish Hosking, Cameron Luo, Oscar J Twine, Natalie Bauer, Denis C Gigascience Technical Note BACKGROUND: Many traits and diseases are thought to be driven by >1 gene (polygenic). Polygenic risk scores (PRS) hence expand on genome-wide association studies by taking multiple genes into account when risk models are built. However, PRS only considers the additive effect of individual genes but not epistatic interactions or the combination of individual and interacting drivers. While evidence of epistatic interactions ais found in small datasets, large datasets have not been processed yet owing to the high computational complexity of the search for epistatic interactions. FINDINGS: We have developed VariantSpark, a distributed machine learning framework able to perform association analysis for complex phenotypes that are polygenic and potentially involve a large number of epistatic interactions. Efficient multi-layer parallelization allows VariantSpark to scale to the whole genome of population-scale datasets with 100,000,000 genomic variants and 100,000 samples. CONCLUSIONS: Compared with traditional monogenic genome-wide association studies, VariantSpark better identifies genomic variants associated with complex phenotypes. VariantSpark is 3.6 times faster than ReForeSt and the only method able to scale to ultra-high-dimensional genomic data in a manageable time. Oxford University Press 2020-08-06 /pmc/articles/PMC7407261/ /pubmed/32761098 http://dx.doi.org/10.1093/gigascience/giaa077 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Technical Note
Bayat, Arash
Szul, Piotr
O’Brien, Aidan R
Dunne, Robert
Hosking, Brendan
Jain, Yatish
Hosking, Cameron
Luo, Oscar J
Twine, Natalie
Bauer, Denis C
VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data
title VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data
title_full VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data
title_fullStr VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data
title_full_unstemmed VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data
title_short VariantSpark: Cloud-based machine learning for association study of complex phenotype and large-scale genomic data
title_sort variantspark: cloud-based machine learning for association study of complex phenotype and large-scale genomic data
topic Technical Note
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7407261/
https://www.ncbi.nlm.nih.gov/pubmed/32761098
http://dx.doi.org/10.1093/gigascience/giaa077
work_keys_str_mv AT bayatarash variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata
AT szulpiotr variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata
AT obrienaidanr variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata
AT dunnerobert variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata
AT hoskingbrendan variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata
AT jainyatish variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata
AT hoskingcameron variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata
AT luooscarj variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata
AT twinenatalie variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata
AT bauerdenisc variantsparkcloudbasedmachinelearningforassociationstudyofcomplexphenotypeandlargescalegenomicdata