Cargando…

Rapid phylogenetic and functional classification of short genomic fragments with signature peptides

BACKGROUND: Classification is difficult for shotgun metagenomics data from environments such as soils, where the diversity of sequences is high and where reference sequences from close relatives may not exist. Approaches based on sequence-similarity scores must deal with the confounding effects that...

Descripción completa

Detalles Bibliográficos
Autores principales: Berendzen, Joel, Bruno, William J, Cohn, Judith D, Hengartner, Nicolas W, Kuske, Cheryl R, McMahon, Benjamin H, Wolinsky, Murray A, Xie, Gary
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3772700/
https://www.ncbi.nlm.nih.gov/pubmed/22925230
http://dx.doi.org/10.1186/1756-0500-5-460
_version_ 1782284350651367424
author Berendzen, Joel
Bruno, William J
Cohn, Judith D
Hengartner, Nicolas W
Kuske, Cheryl R
McMahon, Benjamin H
Wolinsky, Murray A
Xie, Gary
author_facet Berendzen, Joel
Bruno, William J
Cohn, Judith D
Hengartner, Nicolas W
Kuske, Cheryl R
McMahon, Benjamin H
Wolinsky, Murray A
Xie, Gary
author_sort Berendzen, Joel
collection PubMed
description BACKGROUND: Classification is difficult for shotgun metagenomics data from environments such as soils, where the diversity of sequences is high and where reference sequences from close relatives may not exist. Approaches based on sequence-similarity scores must deal with the confounding effects that inheritance and functional pressures exert on the relation between scores and phylogenetic distance, while approaches based on sequence alignment and tree-building are typically limited to a small fraction of gene families. We describe an approach based on finding one or more exact matches between a read and a precomputed set of peptide 10-mers. RESULTS: At even the largest phylogenetic distances, thousands of 10-mer peptide exact matches can be found between pairs of bacterial genomes. Genes that share one or more peptide 10-mers typically have high reciprocal BLAST scores. Among a set of 403 representative bacterial genomes, some 20 million 10-mer peptides were found to be shared. We assign each of these peptides as a signature of a particular node in a phylogenetic reference tree based on the RNA polymerase genes. We classify the phylogeny of a genomic fragment (e.g., read) at the most specific node on the reference tree that is consistent with the phylogeny of observed signature peptides it contains. Using both synthetic data from four newly-sequenced soil-bacterium genomes and ten real soil metagenomics data sets, we demonstrate a sensitivity and specificity comparable to that of the MEGAN metagenomics analysis package using BLASTX against the NR database. Phylogenetic and functional similarity metrics applied to real metagenomics data indicates a signal-to-noise ratio of approximately 400 for distinguishing among environments. Our method assigns ~6.6 Gbp/hr on a single CPU, compared with 25 kbp/hr for methods based on BLASTX against the NR database. CONCLUSIONS: Classification by exact matching against a precomputed list of signature peptides provides comparable results to existing techniques for reads longer than about 300 bp and does not degrade severely with shorter reads. Orders of magnitude faster than existing methods, the approach is suitable now for inclusion in analysis pipelines and appears to be extensible in several different directions.
format Online
Article
Text
id pubmed-3772700
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-37727002013-09-14 Rapid phylogenetic and functional classification of short genomic fragments with signature peptides Berendzen, Joel Bruno, William J Cohn, Judith D Hengartner, Nicolas W Kuske, Cheryl R McMahon, Benjamin H Wolinsky, Murray A Xie, Gary BMC Res Notes Research Article BACKGROUND: Classification is difficult for shotgun metagenomics data from environments such as soils, where the diversity of sequences is high and where reference sequences from close relatives may not exist. Approaches based on sequence-similarity scores must deal with the confounding effects that inheritance and functional pressures exert on the relation between scores and phylogenetic distance, while approaches based on sequence alignment and tree-building are typically limited to a small fraction of gene families. We describe an approach based on finding one or more exact matches between a read and a precomputed set of peptide 10-mers. RESULTS: At even the largest phylogenetic distances, thousands of 10-mer peptide exact matches can be found between pairs of bacterial genomes. Genes that share one or more peptide 10-mers typically have high reciprocal BLAST scores. Among a set of 403 representative bacterial genomes, some 20 million 10-mer peptides were found to be shared. We assign each of these peptides as a signature of a particular node in a phylogenetic reference tree based on the RNA polymerase genes. We classify the phylogeny of a genomic fragment (e.g., read) at the most specific node on the reference tree that is consistent with the phylogeny of observed signature peptides it contains. Using both synthetic data from four newly-sequenced soil-bacterium genomes and ten real soil metagenomics data sets, we demonstrate a sensitivity and specificity comparable to that of the MEGAN metagenomics analysis package using BLASTX against the NR database. Phylogenetic and functional similarity metrics applied to real metagenomics data indicates a signal-to-noise ratio of approximately 400 for distinguishing among environments. Our method assigns ~6.6 Gbp/hr on a single CPU, compared with 25 kbp/hr for methods based on BLASTX against the NR database. CONCLUSIONS: Classification by exact matching against a precomputed list of signature peptides provides comparable results to existing techniques for reads longer than about 300 bp and does not degrade severely with shorter reads. Orders of magnitude faster than existing methods, the approach is suitable now for inclusion in analysis pipelines and appears to be extensible in several different directions. BioMed Central 2012-08-28 /pmc/articles/PMC3772700/ /pubmed/22925230 http://dx.doi.org/10.1186/1756-0500-5-460 Text en Copyright © 2012 Berendzen et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Berendzen, Joel
Bruno, William J
Cohn, Judith D
Hengartner, Nicolas W
Kuske, Cheryl R
McMahon, Benjamin H
Wolinsky, Murray A
Xie, Gary
Rapid phylogenetic and functional classification of short genomic fragments with signature peptides
title Rapid phylogenetic and functional classification of short genomic fragments with signature peptides
title_full Rapid phylogenetic and functional classification of short genomic fragments with signature peptides
title_fullStr Rapid phylogenetic and functional classification of short genomic fragments with signature peptides
title_full_unstemmed Rapid phylogenetic and functional classification of short genomic fragments with signature peptides
title_short Rapid phylogenetic and functional classification of short genomic fragments with signature peptides
title_sort rapid phylogenetic and functional classification of short genomic fragments with signature peptides
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3772700/
https://www.ncbi.nlm.nih.gov/pubmed/22925230
http://dx.doi.org/10.1186/1756-0500-5-460
work_keys_str_mv AT berendzenjoel rapidphylogeneticandfunctionalclassificationofshortgenomicfragmentswithsignaturepeptides
AT brunowilliamj rapidphylogeneticandfunctionalclassificationofshortgenomicfragmentswithsignaturepeptides
AT cohnjudithd rapidphylogeneticandfunctionalclassificationofshortgenomicfragmentswithsignaturepeptides
AT hengartnernicolasw rapidphylogeneticandfunctionalclassificationofshortgenomicfragmentswithsignaturepeptides
AT kuskecherylr rapidphylogeneticandfunctionalclassificationofshortgenomicfragmentswithsignaturepeptides
AT mcmahonbenjaminh rapidphylogeneticandfunctionalclassificationofshortgenomicfragmentswithsignaturepeptides
AT wolinskymurraya rapidphylogeneticandfunctionalclassificationofshortgenomicfragmentswithsignaturepeptides
AT xiegary rapidphylogeneticandfunctionalclassificationofshortgenomicfragmentswithsignaturepeptides