Cargando…

BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data

BACKGROUND: The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ausmees, Kristiina, John, Aji, Toor, Salman Z., Hellander, Andreas, Nettelblad, Carl
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2018
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6019789/ https://www.ncbi.nlm.nih.gov/pubmed/29940842 http://dx.doi.org/10.1186/s12859-018-2241-z

_version_	1783335182837743616
author	Ausmees, Kristiina John, Aji Toor, Salman Z. Hellander, Andreas Nettelblad, Carl
author_facet	Ausmees, Kristiina John, Aji Toor, Salman Z. Hellander, Andreas Nettelblad, Carl
author_sort	Ausmees, Kristiina
collection	PubMed
description	BACKGROUND: The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a highly scalable, storage-agnostic framework, an associated API and an easy-to-use web user interface to execute custom filters on large genomic datasets. RESULTS: We present BAMSI, a Software as-a Service (SaaS) solution for filtering of the 1000 Genomes phase 3 set of aligned reads, with the possibility of extension and customization to other sets of files. Unique to our solution is the capability of simultaneously utilizing many different mirrors of the data to increase the speed of the analysis. In particular, if the data is available in private or public clouds – an increasingly common scenario for both academic and commercial cloud providers – our framework allows for seamless deployment of filtering workers close to data. We show results indicating that such a setup improves the horizontal scalability of the system, and present a possible use case of the framework by performing an analysis of structural variation in the 1000 Genomes data set. CONCLUSIONS: BAMSI constitutes a framework for efficient filtering of large genomic data sets that is flexible in the use of compute as well as storage resources. The data resulting from the filter is assumed to be greatly reduced in size, and can easily be downloaded or routed into e.g. a Hadoop cluster for subsequent interactive analysis using Hive, Spark or similar tools. In this respect, our framework also suggests a general model for making very large datasets of high scientific value more accessible by offering the possibility for organizations to share the cost of hosting data on hot storage, without compromising the scalability of downstream analysis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2241-z) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6019789
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-60197892018-07-06 BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data Ausmees, Kristiina John, Aji Toor, Salman Z. Hellander, Andreas Nettelblad, Carl BMC Bioinformatics Software BACKGROUND: The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a highly scalable, storage-agnostic framework, an associated API and an easy-to-use web user interface to execute custom filters on large genomic datasets. RESULTS: We present BAMSI, a Software as-a Service (SaaS) solution for filtering of the 1000 Genomes phase 3 set of aligned reads, with the possibility of extension and customization to other sets of files. Unique to our solution is the capability of simultaneously utilizing many different mirrors of the data to increase the speed of the analysis. In particular, if the data is available in private or public clouds – an increasingly common scenario for both academic and commercial cloud providers – our framework allows for seamless deployment of filtering workers close to data. We show results indicating that such a setup improves the horizontal scalability of the system, and present a possible use case of the framework by performing an analysis of structural variation in the 1000 Genomes data set. CONCLUSIONS: BAMSI constitutes a framework for efficient filtering of large genomic data sets that is flexible in the use of compute as well as storage resources. The data resulting from the filter is assumed to be greatly reduced in size, and can easily be downloaded or routed into e.g. a Hadoop cluster for subsequent interactive analysis using Hive, Spark or similar tools. In this respect, our framework also suggests a general model for making very large datasets of high scientific value more accessible by offering the possibility for organizations to share the cost of hosting data on hot storage, without compromising the scalability of downstream analysis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2241-z) contains supplementary material, which is available to authorized users. BioMed Central 2018-06-26 /pmc/articles/PMC6019789/ /pubmed/29940842 http://dx.doi.org/10.1186/s12859-018-2241-z Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Software Ausmees, Kristiina John, Aji Toor, Salman Z. Hellander, Andreas Nettelblad, Carl BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title	BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_full	BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_fullStr	BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_full_unstemmed	BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_short	BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_sort	bamsi: a multi-cloud service for scalable distributed filtering of massive genome data
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6019789/ https://www.ncbi.nlm.nih.gov/pubmed/29940842 http://dx.doi.org/10.1186/s12859-018-2241-z
work_keys_str_mv	AT ausmeeskristiina bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata AT johnaji bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata AT toorsalmanz bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata AT hellanderandreas bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata AT nettelbladcarl bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata

BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data

Ejemplares similares