Cargando…

Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing

BACKGROUND: Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that...

Descripción completa

Detalles Bibliográficos
Autores principales: Zhao, Shanrong, Prenger, Kurt, Smith, Lance, Messina, Thomas, Fan, Hongtao, Jaeger, Edward, Stephens, Susan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2013
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698007/
https://www.ncbi.nlm.nih.gov/pubmed/23802613
http://dx.doi.org/10.1186/1471-2164-14-425
_version_ 1782275222219522048
author Zhao, Shanrong
Prenger, Kurt
Smith, Lance
Messina, Thomas
Fan, Hongtao
Jaeger, Edward
Stephens, Susan
author_facet Zhao, Shanrong
Prenger, Kurt
Smith, Lance
Messina, Thomas
Fan, Hongtao
Jaeger, Edward
Stephens, Susan
author_sort Zhao, Shanrong
collection PubMed
description BACKGROUND: Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses. RESULTS: Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies. CONCLUSIONS: Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available for third-party implementation and use, and can be downloaded from http://s3.amazonaws.com/jnj_rainbow/index.html.
format Online
Article
Text
id pubmed-3698007
institution National Center for Biotechnology Information
language English
publishDate 2013
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-36980072013-07-02 Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing Zhao, Shanrong Prenger, Kurt Smith, Lance Messina, Thomas Fan, Hongtao Jaeger, Edward Stephens, Susan BMC Genomics Software BACKGROUND: Technical improvements have decreased sequencing costs and, as a result, the size and number of genomic datasets have increased rapidly. Because of the lower cost, large amounts of sequence data are now being produced by small to midsize research groups. Crossbow is a software tool that can detect single nucleotide polymorphisms (SNPs) in whole-genome sequencing (WGS) data from a single subject; however, Crossbow has a number of limitations when applied to multiple subjects from large-scale WGS projects. The data storage and CPU resources that are required for large-scale whole genome sequencing data analyses are too large for many core facilities and individual laboratories to provide. To help meet these challenges, we have developed Rainbow, a cloud-based software package that can assist in the automation of large-scale WGS data analyses. RESULTS: Here, we evaluated the performance of Rainbow by analyzing 44 different whole-genome-sequenced subjects. Rainbow has the capacity to process genomic data from more than 500 subjects in two weeks using cloud computing provided by the Amazon Web Service. The time includes the import and export of the data using Amazon Import/Export service. The average cost of processing a single sample in the cloud was less than 120 US dollars. Compared with Crossbow, the main improvements incorporated into Rainbow include the ability: (1) to handle BAM as well as FASTQ input files; (2) to split large sequence files for better load balance downstream; (3) to log the running metrics in data processing and monitoring multiple Amazon Elastic Compute Cloud (EC2) instances; and (4) to merge SOAPsnp outputs for multiple individuals into a single file to facilitate downstream genome-wide association studies. CONCLUSIONS: Rainbow is a scalable, cost-effective, and open-source tool for large-scale WGS data analysis. For human WGS data sequenced by either the Illumina HiSeq 2000 or HiSeq 2500 platforms, Rainbow can be used straight out of the box. Rainbow is available for third-party implementation and use, and can be downloaded from http://s3.amazonaws.com/jnj_rainbow/index.html. BioMed Central 2013-06-27 /pmc/articles/PMC3698007/ /pubmed/23802613 http://dx.doi.org/10.1186/1471-2164-14-425 Text en Copyright © 2013 Zhao et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Zhao, Shanrong
Prenger, Kurt
Smith, Lance
Messina, Thomas
Fan, Hongtao
Jaeger, Edward
Stephens, Susan
Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
title Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
title_full Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
title_fullStr Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
title_full_unstemmed Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
title_short Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
title_sort rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3698007/
https://www.ncbi.nlm.nih.gov/pubmed/23802613
http://dx.doi.org/10.1186/1471-2164-14-425
work_keys_str_mv AT zhaoshanrong rainbowatoolforlargescalewholegenomesequencingdataanalysisusingcloudcomputing
AT prengerkurt rainbowatoolforlargescalewholegenomesequencingdataanalysisusingcloudcomputing
AT smithlance rainbowatoolforlargescalewholegenomesequencingdataanalysisusingcloudcomputing
AT messinathomas rainbowatoolforlargescalewholegenomesequencingdataanalysisusingcloudcomputing
AT fanhongtao rainbowatoolforlargescalewholegenomesequencingdataanalysisusingcloudcomputing
AT jaegeredward rainbowatoolforlargescalewholegenomesequencingdataanalysisusingcloudcomputing
AT stephenssusan rainbowatoolforlargescalewholegenomesequencingdataanalysisusingcloudcomputing