Cargando…

Unsupervised statistical clustering of environmental shotgun sequences

BACKGROUND: The development of effective environmental shotgun sequence binning methods remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods have focused primarily on supervised learning involving extrinsic data, a first-principles statistical model combin...

Descripción completa

Detalles Bibliográficos
Autores principales: Kislyuk, Andrey, Bhatnagar, Srijak, Dushoff, Jonathan, Weitz, Joshua S
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2765972/
https://www.ncbi.nlm.nih.gov/pubmed/19799776
http://dx.doi.org/10.1186/1471-2105-10-316
_version_ 1782173184768868352
author Kislyuk, Andrey
Bhatnagar, Srijak
Dushoff, Jonathan
Weitz, Joshua S
author_facet Kislyuk, Andrey
Bhatnagar, Srijak
Dushoff, Jonathan
Weitz, Joshua S
author_sort Kislyuk, Andrey
collection PubMed
description BACKGROUND: The development of effective environmental shotgun sequence binning methods remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods have focused primarily on supervised learning involving extrinsic data, a first-principles statistical model combined with a self-training fitting method has not yet been developed. RESULTS: We derive an unsupervised, maximum-likelihood formalism for clustering short sequences by their taxonomic origin on the basis of their k-mer distributions. The formalism is implemented using a Markov Chain Monte Carlo approach in a k-mer feature space. We introduce a space transformation that reduces the dimensionality of the feature space and a genomic fragment divergence measure that strongly correlates with the method's performance. Pairwise analysis of over 1000 completely sequenced genomes reveals that the vast majority of genomes have sufficient genomic fragment divergence to be amenable for binning using the present formalism. Using a high-performance implementation, the binner is able to classify fragments as short as 400 nt with accuracy over 90% in simulations of low-complexity communities of 2 to 10 species, given sufficient genomic fragment divergence. The method is available as an open source package called LikelyBin. CONCLUSION: An unsupervised binning method based on statistical signatures of short environmental sequences is a viable stand-alone binning method for low complexity samples. For medium and high complexity samples, we discuss the possibility of combining the current method with other methods as part of an iterative process to enhance the resolving power of sorting reads into taxonomic and/or functional bins.
format Text
id pubmed-2765972
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27659722009-10-23 Unsupervised statistical clustering of environmental shotgun sequences Kislyuk, Andrey Bhatnagar, Srijak Dushoff, Jonathan Weitz, Joshua S BMC Bioinformatics Methodology Article BACKGROUND: The development of effective environmental shotgun sequence binning methods remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods have focused primarily on supervised learning involving extrinsic data, a first-principles statistical model combined with a self-training fitting method has not yet been developed. RESULTS: We derive an unsupervised, maximum-likelihood formalism for clustering short sequences by their taxonomic origin on the basis of their k-mer distributions. The formalism is implemented using a Markov Chain Monte Carlo approach in a k-mer feature space. We introduce a space transformation that reduces the dimensionality of the feature space and a genomic fragment divergence measure that strongly correlates with the method's performance. Pairwise analysis of over 1000 completely sequenced genomes reveals that the vast majority of genomes have sufficient genomic fragment divergence to be amenable for binning using the present formalism. Using a high-performance implementation, the binner is able to classify fragments as short as 400 nt with accuracy over 90% in simulations of low-complexity communities of 2 to 10 species, given sufficient genomic fragment divergence. The method is available as an open source package called LikelyBin. CONCLUSION: An unsupervised binning method based on statistical signatures of short environmental sequences is a viable stand-alone binning method for low complexity samples. For medium and high complexity samples, we discuss the possibility of combining the current method with other methods as part of an iterative process to enhance the resolving power of sorting reads into taxonomic and/or functional bins. BioMed Central 2009-10-02 /pmc/articles/PMC2765972/ /pubmed/19799776 http://dx.doi.org/10.1186/1471-2105-10-316 Text en Copyright © 2009 Kislyuk et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Kislyuk, Andrey
Bhatnagar, Srijak
Dushoff, Jonathan
Weitz, Joshua S
Unsupervised statistical clustering of environmental shotgun sequences
title Unsupervised statistical clustering of environmental shotgun sequences
title_full Unsupervised statistical clustering of environmental shotgun sequences
title_fullStr Unsupervised statistical clustering of environmental shotgun sequences
title_full_unstemmed Unsupervised statistical clustering of environmental shotgun sequences
title_short Unsupervised statistical clustering of environmental shotgun sequences
title_sort unsupervised statistical clustering of environmental shotgun sequences
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2765972/
https://www.ncbi.nlm.nih.gov/pubmed/19799776
http://dx.doi.org/10.1186/1471-2105-10-316
work_keys_str_mv AT kislyukandrey unsupervisedstatisticalclusteringofenvironmentalshotgunsequences
AT bhatnagarsrijak unsupervisedstatisticalclusteringofenvironmentalshotgunsequences
AT dushoffjonathan unsupervisedstatisticalclusteringofenvironmentalshotgunsequences
AT weitzjoshuas unsupervisedstatisticalclusteringofenvironmentalshotgunsequences