Cargando…

BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data

BACKGROUND: The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the...

Descripción completa

Detalles Bibliográficos
Autores principales: Ausmees, Kristiina, John, Aji, Toor, Salman Z., Hellander, Andreas, Nettelblad, Carl
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6019789/
https://www.ncbi.nlm.nih.gov/pubmed/29940842
http://dx.doi.org/10.1186/s12859-018-2241-z
_version_ 1783335182837743616
author Ausmees, Kristiina
John, Aji
Toor, Salman Z.
Hellander, Andreas
Nettelblad, Carl
author_facet Ausmees, Kristiina
John, Aji
Toor, Salman Z.
Hellander, Andreas
Nettelblad, Carl
author_sort Ausmees, Kristiina
collection PubMed
description BACKGROUND: The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a highly scalable, storage-agnostic framework, an associated API and an easy-to-use web user interface to execute custom filters on large genomic datasets. RESULTS: We present BAMSI, a Software as-a Service (SaaS) solution for filtering of the 1000 Genomes phase 3 set of aligned reads, with the possibility of extension and customization to other sets of files. Unique to our solution is the capability of simultaneously utilizing many different mirrors of the data to increase the speed of the analysis. In particular, if the data is available in private or public clouds – an increasingly common scenario for both academic and commercial cloud providers – our framework allows for seamless deployment of filtering workers close to data. We show results indicating that such a setup improves the horizontal scalability of the system, and present a possible use case of the framework by performing an analysis of structural variation in the 1000 Genomes data set. CONCLUSIONS: BAMSI constitutes a framework for efficient filtering of large genomic data sets that is flexible in the use of compute as well as storage resources. The data resulting from the filter is assumed to be greatly reduced in size, and can easily be downloaded or routed into e.g. a Hadoop cluster for subsequent interactive analysis using Hive, Spark or similar tools. In this respect, our framework also suggests a general model for making very large datasets of high scientific value more accessible by offering the possibility for organizations to share the cost of hosting data on hot storage, without compromising the scalability of downstream analysis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2241-z) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6019789
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-60197892018-07-06 BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data Ausmees, Kristiina John, Aji Toor, Salman Z. Hellander, Andreas Nettelblad, Carl BMC Bioinformatics Software BACKGROUND: The advent of next-generation sequencing (NGS) has made whole-genome sequencing of cohorts of individuals a reality. Primary datasets of raw or aligned reads of this sort can get very large. For scientific questions where curated called variants are not sufficient, the sheer size of the datasets makes analysis prohibitively expensive. In order to make re-analysis of such data feasible without the need to have access to a large-scale computing facility, we have developed a highly scalable, storage-agnostic framework, an associated API and an easy-to-use web user interface to execute custom filters on large genomic datasets. RESULTS: We present BAMSI, a Software as-a Service (SaaS) solution for filtering of the 1000 Genomes phase 3 set of aligned reads, with the possibility of extension and customization to other sets of files. Unique to our solution is the capability of simultaneously utilizing many different mirrors of the data to increase the speed of the analysis. In particular, if the data is available in private or public clouds – an increasingly common scenario for both academic and commercial cloud providers – our framework allows for seamless deployment of filtering workers close to data. We show results indicating that such a setup improves the horizontal scalability of the system, and present a possible use case of the framework by performing an analysis of structural variation in the 1000 Genomes data set. CONCLUSIONS: BAMSI constitutes a framework for efficient filtering of large genomic data sets that is flexible in the use of compute as well as storage resources. The data resulting from the filter is assumed to be greatly reduced in size, and can easily be downloaded or routed into e.g. a Hadoop cluster for subsequent interactive analysis using Hive, Spark or similar tools. In this respect, our framework also suggests a general model for making very large datasets of high scientific value more accessible by offering the possibility for organizations to share the cost of hosting data on hot storage, without compromising the scalability of downstream analysis. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-018-2241-z) contains supplementary material, which is available to authorized users. BioMed Central 2018-06-26 /pmc/articles/PMC6019789/ /pubmed/29940842 http://dx.doi.org/10.1186/s12859-018-2241-z Text en © The Author(s) 2018 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Software
Ausmees, Kristiina
John, Aji
Toor, Salman Z.
Hellander, Andreas
Nettelblad, Carl
BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_full BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_fullStr BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_full_unstemmed BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_short BAMSI: a multi-cloud service for scalable distributed filtering of massive genome data
title_sort bamsi: a multi-cloud service for scalable distributed filtering of massive genome data
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6019789/
https://www.ncbi.nlm.nih.gov/pubmed/29940842
http://dx.doi.org/10.1186/s12859-018-2241-z
work_keys_str_mv AT ausmeeskristiina bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata
AT johnaji bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata
AT toorsalmanz bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata
AT hellanderandreas bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata
AT nettelbladcarl bamsiamulticloudserviceforscalabledistributedfilteringofmassivegenomedata