Cargando…

A hybrid computational strategy to address WGS variant analysis in >5000 samples

BACKGROUND: The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like t...

Descripción completa

Detalles Bibliográficos
Autores principales: Huang, Zhuoyi, Rustagi, Navin, Veeraraghavan, Narayanan, Carroll, Andrew, Gibbs, Richard, Boerwinkle, Eric, Venkata, Manjunath Gorentla, Yu, Fuli
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018196/
https://www.ncbi.nlm.nih.gov/pubmed/27612449
http://dx.doi.org/10.1186/s12859-016-1211-6
_version_ 1782452876226854912
author Huang, Zhuoyi
Rustagi, Navin
Veeraraghavan, Narayanan
Carroll, Andrew
Gibbs, Richard
Boerwinkle, Eric
Venkata, Manjunath Gorentla
Yu, Fuli
author_facet Huang, Zhuoyi
Rustagi, Navin
Veeraraghavan, Narayanan
Carroll, Andrew
Gibbs, Richard
Boerwinkle, Eric
Venkata, Manjunath Gorentla
Yu, Fuli
author_sort Huang, Zhuoyi
collection PubMed
description BACKGROUND: The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. RESULTS: We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. CONCLUSIONS: Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1211-6) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5018196
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-50181962016-09-19 A hybrid computational strategy to address WGS variant analysis in >5000 samples Huang, Zhuoyi Rustagi, Navin Veeraraghavan, Narayanan Carroll, Andrew Gibbs, Richard Boerwinkle, Eric Venkata, Manjunath Gorentla Yu, Fuli BMC Bioinformatics Methodology Article BACKGROUND: The decreasing costs of sequencing are driving the need for cost effective and real time variant calling of whole genome sequencing data. The scale of these projects are far beyond the capacity of typical computing resources available with most research labs. Other infrastructures like the cloud AWS environment and supercomputers also have limitations due to which large scale joint variant calling becomes infeasible, and infrastructure specific variant calling strategies either fail to scale up to large datasets or abandon joint calling strategies. RESULTS: We present a high throughput framework including multiple variant callers for single nucleotide variant (SNV) calling, which leverages hybrid computing infrastructure consisting of cloud AWS, supercomputers and local high performance computing infrastructures. We present a novel binning approach for large scale joint variant calling and imputation which can scale up to over 10,000 samples while producing SNV callsets with high sensitivity and specificity. As a proof of principle, we present results of analysis on Cohorts for Heart And Aging Research in Genomic Epidemiology (CHARGE) WGS freeze 3 dataset in which joint calling, imputation and phasing of over 5300 whole genome samples was produced in under 6 weeks using four state-of-the-art callers. The callers used were SNPTools, GATK-HaplotypeCaller, GATK-UnifiedGenotyper and GotCloud. We used Amazon AWS, a 4000-core in-house cluster at Baylor College of Medicine, IBM power PC Blue BioU at Rice and Rhea at Oak Ridge National Laboratory (ORNL) for the computation. AWS was used for joint calling of 180 TB of BAM files, and ORNL and Rice supercomputers were used for the imputation and phasing step. All other steps were carried out on the local compute cluster. The entire operation used 5.2 million core hours and only transferred a total of 6 TB of data across the platforms. CONCLUSIONS: Even with increasing sizes of whole genome datasets, ensemble joint calling of SNVs for low coverage data can be accomplished in a scalable, cost effective and fast manner by using heterogeneous computing platforms without compromising on the quality of variants. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1211-6) contains supplementary material, which is available to authorized users. BioMed Central 2016-09-10 /pmc/articles/PMC5018196/ /pubmed/27612449 http://dx.doi.org/10.1186/s12859-016-1211-6 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology Article
Huang, Zhuoyi
Rustagi, Navin
Veeraraghavan, Narayanan
Carroll, Andrew
Gibbs, Richard
Boerwinkle, Eric
Venkata, Manjunath Gorentla
Yu, Fuli
A hybrid computational strategy to address WGS variant analysis in >5000 samples
title A hybrid computational strategy to address WGS variant analysis in >5000 samples
title_full A hybrid computational strategy to address WGS variant analysis in >5000 samples
title_fullStr A hybrid computational strategy to address WGS variant analysis in >5000 samples
title_full_unstemmed A hybrid computational strategy to address WGS variant analysis in >5000 samples
title_short A hybrid computational strategy to address WGS variant analysis in >5000 samples
title_sort hybrid computational strategy to address wgs variant analysis in >5000 samples
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5018196/
https://www.ncbi.nlm.nih.gov/pubmed/27612449
http://dx.doi.org/10.1186/s12859-016-1211-6
work_keys_str_mv AT huangzhuoyi ahybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT rustaginavin ahybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT veeraraghavannarayanan ahybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT carrollandrew ahybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT gibbsrichard ahybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT boerwinkleeric ahybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT venkatamanjunathgorentla ahybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT yufuli ahybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT huangzhuoyi hybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT rustaginavin hybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT veeraraghavannarayanan hybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT carrollandrew hybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT gibbsrichard hybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT boerwinkleeric hybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT venkatamanjunathgorentla hybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples
AT yufuli hybridcomputationalstrategytoaddresswgsvariantanalysisin5000samples