Cargando…

A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes

BACKGROUND: Metagenomics holds great promises for deepening our knowledge of key bacterial driven processes, but metagenome assembly remains problematic, typically resulting in representation biases and discarding significant amounts of non-redundant sequence information. In order to alleviate const...

Descripción completa

Detalles Bibliográficos
Autores principales:	Gkanogiannis, Anestis, Gazut, Stéphane, Salanoubat, Marcel, Kanj, Sawsan, Brüls, Thomas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4992282/ https://www.ncbi.nlm.nih.gov/pubmed/27542753 http://dx.doi.org/10.1186/s12859-016-1186-3

_version_	1782448991429984256
author	Gkanogiannis, Anestis Gazut, Stéphane Salanoubat, Marcel Kanj, Sawsan Brüls, Thomas
author_facet	Gkanogiannis, Anestis Gazut, Stéphane Salanoubat, Marcel Kanj, Sawsan Brüls, Thomas
author_sort	Gkanogiannis, Anestis
collection	PubMed
description	BACKGROUND: Metagenomics holds great promises for deepening our knowledge of key bacterial driven processes, but metagenome assembly remains problematic, typically resulting in representation biases and discarding significant amounts of non-redundant sequence information. In order to alleviate constraints assembly can impose on downstream analyses, and/or to increase the fraction of raw reads assembled via targeted assemblies relying on pre-assembly binning steps, we developed a set of binning modules and evaluated their combination in a new “assembly-free” binning protocol. RESULTS: We describe a scalable multi-tiered binning algorithm that combines frequency and compositional features to cluster unassembled reads, and demonstrate i) significant runtime performance gains of the developed modules against state of the art software, obtained through parallelization and the efficient use of large lock-free concurrent hash maps, ii) its relevance for clustering unassembled reads from high complexity (e.g., harboring 700 distinct genomes) samples, iii) its relevance to experimental setups involving multiple samples, through a use case consisting in the “de novo” identification of sequences from a target genome (e.g., a pathogenic strain) segregating at low levels in a cohort of 50 complex microbiomes (harboring 100 distinct genomes each), in the background of closely related strains and the absence of reference genomes, iv) its ability to correctly identify clusters of sequences from the E. coli O104:H4 genome as the most strongly correlated to the infection status in 53 microbiomes sampled from the 2011 STEC outbreak in Germany, and to accurately cluster contigs of this pathogenic strain from a cross-assembly of these 53 microbiomes. CONCLUSIONS: We present a set of sequence clustering (“binning”) modules and their application to biomarker (e.g., genomes of pathogenic organisms) discovery from large synthetic and real metagenomics datasets. Initially designed for the “assembly-free” analysis of individual metagenomic samples, we demonstrate their extension to setups involving multiple samples via the usage of the “alignment-free” d(2)S statistic to relate clusters across samples, and illustrate how the clustering modules can otherwise be leveraged for de novo “pre-assembly” tasks by segregating sequences into biologically meaningful partitions. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1186-3) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4992282
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-49922822016-08-31 A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes Gkanogiannis, Anestis Gazut, Stéphane Salanoubat, Marcel Kanj, Sawsan Brüls, Thomas BMC Bioinformatics Methodology Article BACKGROUND: Metagenomics holds great promises for deepening our knowledge of key bacterial driven processes, but metagenome assembly remains problematic, typically resulting in representation biases and discarding significant amounts of non-redundant sequence information. In order to alleviate constraints assembly can impose on downstream analyses, and/or to increase the fraction of raw reads assembled via targeted assemblies relying on pre-assembly binning steps, we developed a set of binning modules and evaluated their combination in a new “assembly-free” binning protocol. RESULTS: We describe a scalable multi-tiered binning algorithm that combines frequency and compositional features to cluster unassembled reads, and demonstrate i) significant runtime performance gains of the developed modules against state of the art software, obtained through parallelization and the efficient use of large lock-free concurrent hash maps, ii) its relevance for clustering unassembled reads from high complexity (e.g., harboring 700 distinct genomes) samples, iii) its relevance to experimental setups involving multiple samples, through a use case consisting in the “de novo” identification of sequences from a target genome (e.g., a pathogenic strain) segregating at low levels in a cohort of 50 complex microbiomes (harboring 100 distinct genomes each), in the background of closely related strains and the absence of reference genomes, iv) its ability to correctly identify clusters of sequences from the E. coli O104:H4 genome as the most strongly correlated to the infection status in 53 microbiomes sampled from the 2011 STEC outbreak in Germany, and to accurately cluster contigs of this pathogenic strain from a cross-assembly of these 53 microbiomes. CONCLUSIONS: We present a set of sequence clustering (“binning”) modules and their application to biomarker (e.g., genomes of pathogenic organisms) discovery from large synthetic and real metagenomics datasets. Initially designed for the “assembly-free” analysis of individual metagenomic samples, we demonstrate their extension to setups involving multiple samples via the usage of the “alignment-free” d(2)S statistic to relate clusters across samples, and illustrate how the clustering modules can otherwise be leveraged for de novo “pre-assembly” tasks by segregating sequences into biologically meaningful partitions. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-1186-3) contains supplementary material, which is available to authorized users. BioMed Central 2016-08-19 /pmc/articles/PMC4992282/ /pubmed/27542753 http://dx.doi.org/10.1186/s12859-016-1186-3 Text en © The Author(s). 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Gkanogiannis, Anestis Gazut, Stéphane Salanoubat, Marcel Kanj, Sawsan Brüls, Thomas A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes
title	A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes
title_full	A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes
title_fullStr	A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes
title_full_unstemmed	A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes
title_short	A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes
title_sort	scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4992282/ https://www.ncbi.nlm.nih.gov/pubmed/27542753 http://dx.doi.org/10.1186/s12859-016-1186-3
work_keys_str_mv	AT gkanogiannisanestis ascalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes AT gazutstephane ascalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes AT salanoubatmarcel ascalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes AT kanjsawsan ascalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes AT brulsthomas ascalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes AT gkanogiannisanestis scalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes AT gazutstephane scalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes AT salanoubatmarcel scalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes AT kanjsawsan scalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes AT brulsthomas scalableassemblyfreevariableselectionalgorithmforbiomarkerdiscoveryfrommetagenomes

A scalable assembly-free variable selection algorithm for biomarker discovery from metagenomes

Ejemplares similares