Cargando…

Analyzing large scale genomic data on the cloud with Sparkhit

MOTIVATION: The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreo...

Descripción completa

Detalles Bibliográficos
Autores principales:	Huang, Liren, Krüger, Jan, Sczyrba, Alexander
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2018
Materias:	Original Papers
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5925781/ https://www.ncbi.nlm.nih.gov/pubmed/29253074 http://dx.doi.org/10.1093/bioinformatics/btx808

_version_	1783318772741832704
author	Huang, Liren Krüger, Jan Sczyrba, Alexander
author_facet	Huang, Liren Krüger, Jan Sczyrba, Alexander
author_sort	Huang, Liren
collection	PubMed
description	MOTIVATION: The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform. RESULTS: Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92–157 times faster than MetaSpark on metagenomic fragment recruitment and 18–32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21 h, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 h, presenting an approach to easily associate large amounts of public datasets with reference data. AVAILABILITY AND IMPLEMENTATION: Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-5925781
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-59257812018-05-04 Analyzing large scale genomic data on the cloud with Sparkhit Huang, Liren Krüger, Jan Sczyrba, Alexander Bioinformatics Original Papers MOTIVATION: The increasing amount of next-generation sequencing data poses a fundamental challenge on large scale genomic analytics. Existing tools use different distributed computational platforms to scale-out bioinformatics workloads. However, the scalability of these tools is not efficient. Moreover, they have heavy run time overheads when pre-processing large amounts of data. To address these limitations, we have developed Sparkhit: a distributed bioinformatics framework built on top of the Apache Spark platform. RESULTS: Sparkhit integrates a variety of analytical methods. It is implemented in the Spark extended MapReduce model. It runs 92–157 times faster than MetaSpark on metagenomic fragment recruitment and 18–32 times faster than Crossbow on data pre-processing. We analyzed 100 terabytes of data across four genomic projects in the cloud in 21 h, which includes the run times of cluster deployment and data downloading. Furthermore, our application on the entire Human Microbiome Project shotgun sequencing data was completed in 2 h, presenting an approach to easily associate large amounts of public datasets with reference data. AVAILABILITY AND IMPLEMENTATION: Sparkhit is freely available at: https://rhinempi.github.io/sparkhit/. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2018-05-01 2017-12-15 /pmc/articles/PMC5925781/ /pubmed/29253074 http://dx.doi.org/10.1093/bioinformatics/btx808 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Original Papers Huang, Liren Krüger, Jan Sczyrba, Alexander Analyzing large scale genomic data on the cloud with Sparkhit
title	Analyzing large scale genomic data on the cloud with Sparkhit
title_full	Analyzing large scale genomic data on the cloud with Sparkhit
title_fullStr	Analyzing large scale genomic data on the cloud with Sparkhit
title_full_unstemmed	Analyzing large scale genomic data on the cloud with Sparkhit
title_short	Analyzing large scale genomic data on the cloud with Sparkhit
title_sort	analyzing large scale genomic data on the cloud with sparkhit
topic	Original Papers
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5925781/ https://www.ncbi.nlm.nih.gov/pubmed/29253074 http://dx.doi.org/10.1093/bioinformatics/btx808
work_keys_str_mv	AT huangliren analyzinglargescalegenomicdataonthecloudwithsparkhit AT krugerjan analyzinglargescalegenomicdataonthecloudwithsparkhit AT sczyrbaalexander analyzinglargescalegenomicdataonthecloudwithsparkhit

Analyzing large scale genomic data on the cloud with Sparkhit

Ejemplares similares