Cargando…

Clustering metagenomic sequences with interpolated Markov models

BACKGROUND: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin...

Descripción completa

Detalles Bibliográficos
Autores principales: Kelley, David R, Salzberg, Steven L
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098094/
https://www.ncbi.nlm.nih.gov/pubmed/21044341
http://dx.doi.org/10.1186/1471-2105-11-544
_version_ 1782203916108169216
author Kelley, David R
Salzberg, Steven L
author_facet Kelley, David R
Salzberg, Steven L
author_sort Kelley, David R
collection PubMed
description BACKGROUND: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects. RESULTS: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available. CONCLUSIONS: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm.
format Text
id pubmed-3098094
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30980942011-07-08 Clustering metagenomic sequences with interpolated Markov models Kelley, David R Salzberg, Steven L BMC Bioinformatics Methodology Article BACKGROUND: Sequencing of environmental DNA (often called metagenomics) has shown tremendous potential to uncover the vast number of unknown microbes that cannot be cultured and sequenced by traditional methods. Because the output from metagenomic sequencing is a large set of reads of unknown origin, clustering reads together that were sequenced from the same species is a crucial analysis step. Many effective approaches to this task rely on sequenced genomes in public databases, but these genomes are a highly biased sample that is not necessarily representative of environments interesting to many metagenomics projects. RESULTS: We present SCIMM (Sequence Clustering with Interpolated Markov Models), an unsupervised sequence clustering method. SCIMM achieves greater clustering accuracy than previous unsupervised approaches. We examine the limitations of unsupervised learning on complex datasets, and suggest a hybrid of SCIMM and supervised learning method Phymm called PHYSCIMM that performs better when evolutionarily close training genomes are available. CONCLUSIONS: SCIMM and PHYSCIMM are highly accurate methods to cluster metagenomic sequences. SCIMM operates entirely unsupervised, making it ideal for environments containing mostly novel microbes. PHYSCIMM uses supervised learning to improve clustering in environments containing microbial strains from well-characterized genera. SCIMM and PHYSCIMM are available open source from http://www.cbcb.umd.edu/software/scimm. BioMed Central 2010-11-02 /pmc/articles/PMC3098094/ /pubmed/21044341 http://dx.doi.org/10.1186/1471-2105-11-544 Text en Copyright ©2010 Kelley and Salzberg; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Kelley, David R
Salzberg, Steven L
Clustering metagenomic sequences with interpolated Markov models
title Clustering metagenomic sequences with interpolated Markov models
title_full Clustering metagenomic sequences with interpolated Markov models
title_fullStr Clustering metagenomic sequences with interpolated Markov models
title_full_unstemmed Clustering metagenomic sequences with interpolated Markov models
title_short Clustering metagenomic sequences with interpolated Markov models
title_sort clustering metagenomic sequences with interpolated markov models
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3098094/
https://www.ncbi.nlm.nih.gov/pubmed/21044341
http://dx.doi.org/10.1186/1471-2105-11-544
work_keys_str_mv AT kelleydavidr clusteringmetagenomicsequenceswithinterpolatedmarkovmodels
AT salzbergstevenl clusteringmetagenomicsequenceswithinterpolatedmarkovmodels