Cargando…

Large scale microbiome profiling in the cloud

MOTIVATION: Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larg...

Descripción completa

Detalles Bibliográficos
Autores principales:	Valdes, Camilo, Stebliankin, Vitalii, Narasimhan, Giri
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2019
Materias:	Ismb/Eccb 2019 Conference Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612844/ https://www.ncbi.nlm.nih.gov/pubmed/31510682 http://dx.doi.org/10.1093/bioinformatics/btz356

_version_	1783432949890285568
author	Valdes, Camilo Stebliankin, Vitalii Narasimhan, Giri
author_facet	Valdes, Camilo Stebliankin, Vitalii Narasimhan, Giri
author_sort	Valdes, Camilo
collection	PubMed
description	MOTIVATION: Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources. RESULTS: We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark’s built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43 552 bacterial genomes from Ensembl. Flint runs on Amazon’s Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40 K genomes on 64 machines in 67 s—an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments. AVAILABILITY AND IMPLEMENTATION: Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format	Online Article Text
id	pubmed-6612844
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-66128442019-07-12 Large scale microbiome profiling in the cloud Valdes, Camilo Stebliankin, Vitalii Narasimhan, Giri Bioinformatics Ismb/Eccb 2019 Conference Proceedings MOTIVATION: Bacterial metagenomics profiling for metagenomic whole sequencing (mWGS) usually starts by aligning sequencing reads to a collection of reference genomes. Current profiling tools are designed to work against a small representative collection of genomes, and do not scale very well to larger reference genome collections. However, large reference genome collections are capable of providing a more complete and accurate profile of the bacterial population in a metagenomics dataset. In this paper, we discuss a scalable, efficient and affordable approach to this problem, bringing big data solutions within the reach of laboratories with modest resources. RESULTS: We developed Flint, a metagenomics profiling pipeline that is built on top of the Apache Spark framework, and is designed for fast real-time profiling of metagenomic samples against a large collection of reference genomes. Flint takes advantage of Spark’s built-in parallelism and streaming engine architecture to quickly map reads against a large (170 GB) reference collection of 43 552 bacterial genomes from Ensembl. Flint runs on Amazon’s Elastic MapReduce service, and is able to profile 1 million Illumina paired-end reads against over 40 K genomes on 64 machines in 67 s—an order of magnitude faster than the state of the art, while using a much larger reference collection. Streaming the sequencing reads allows this approach to sustain mapping rates of 55 million reads per hour, at an hourly cluster cost of $8.00 USD, while avoiding the necessity of storing large quantities of intermediate alignments. AVAILABILITY AND IMPLEMENTATION: Flint is open source software, available under the MIT License (MIT). Source code is available at https://github.com/camilo-v/flint. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-07 2019-07-05 /pmc/articles/PMC6612844/ /pubmed/31510682 http://dx.doi.org/10.1093/bioinformatics/btz356 Text en © The Author(s) 2019. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle	Ismb/Eccb 2019 Conference Proceedings Valdes, Camilo Stebliankin, Vitalii Narasimhan, Giri Large scale microbiome profiling in the cloud
title	Large scale microbiome profiling in the cloud
title_full	Large scale microbiome profiling in the cloud
title_fullStr	Large scale microbiome profiling in the cloud
title_full_unstemmed	Large scale microbiome profiling in the cloud
title_short	Large scale microbiome profiling in the cloud
title_sort	large scale microbiome profiling in the cloud
topic	Ismb/Eccb 2019 Conference Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6612844/ https://www.ncbi.nlm.nih.gov/pubmed/31510682 http://dx.doi.org/10.1093/bioinformatics/btz356
work_keys_str_mv	AT valdescamilo largescalemicrobiomeprofilinginthecloud AT stebliankinvitalii largescalemicrobiomeprofilinginthecloud AT narasimhangiri largescalemicrobiomeprofilinginthecloud

Large scale microbiome profiling in the cloud

Ejemplares similares