Cargando…

Classifying short genomic fragments from novel lineages using composition and homology

BACKGROUND: The assignment of taxonomic attributions to DNA fragments recovered directly from the environment is a vital step in metagenomic data analysis. Assignments can be made using rank-specific classifiers, which assign reads to taxonomic labels from a predetermined level such as named species...

Descripción completa

Detalles Bibliográficos
Autores principales: Parks, Donovan H, MacDonald, Norman J, Beiko, Robert G
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3173459/
https://www.ncbi.nlm.nih.gov/pubmed/21827705
http://dx.doi.org/10.1186/1471-2105-12-328
_version_ 1782211963838791680
author Parks, Donovan H
MacDonald, Norman J
Beiko, Robert G
author_facet Parks, Donovan H
MacDonald, Norman J
Beiko, Robert G
author_sort Parks, Donovan H
collection PubMed
description BACKGROUND: The assignment of taxonomic attributions to DNA fragments recovered directly from the environment is a vital step in metagenomic data analysis. Assignments can be made using rank-specific classifiers, which assign reads to taxonomic labels from a predetermined level such as named species or strain, or rank-flexible classifiers, which choose an appropriate taxonomic rank for each sequence in a data set. The choice of rank typically depends on the optimal model for a given sequence and on the breadth of taxonomic groups seen in a set of close-to-optimal models. Homology-based (e.g., LCA) and composition-based (e.g., PhyloPythia, TACOA) rank-flexible classifiers have been proposed, but there is at present no hybrid approach that utilizes both homology and composition. RESULTS: We first develop a hybrid, rank-specific classifier based on BLAST and Naïve Bayes (NB) that has comparable accuracy and a faster running time than the current best approach, PhymmBL. By substituting LCA for BLAST or allowing the inclusion of suboptimal NB models, we obtain a rank-flexible classifier. This hybrid classifier outperforms established rank-flexible approaches on simulated metagenomic fragments of length 200 bp to 1000 bp and is able to assign taxonomic attributions to a subset of sequences with few misclassifications. We then demonstrate the performance of different classifiers on an enhanced biological phosphorous removal metagenome, illustrating the advantages of rank-flexible classifiers when representative genomes are absent from the set of reference genomes. Application to a glacier ice metagenome demonstrates that similar taxonomic profiles are obtained across a set of classifiers which are increasingly conservative in their classification. CONCLUSIONS: Our NB-based classification scheme is faster than the current best composition-based algorithm, Phymm, while providing equally accurate predictions. The rank-flexible variant of NB, which we term ε-NB, is complementary to LCA and can be combined with it to yield conservative prediction sets of very high confidence. The simple parameterization of LCA and ε-NB allows for tuning of the balance between more predictions and increased precision, allowing the user to account for the sensitivity of downstream analyses to misclassified or unclassified sequences.
format Online
Article
Text
id pubmed-3173459
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31734592011-09-15 Classifying short genomic fragments from novel lineages using composition and homology Parks, Donovan H MacDonald, Norman J Beiko, Robert G BMC Bioinformatics Research Article BACKGROUND: The assignment of taxonomic attributions to DNA fragments recovered directly from the environment is a vital step in metagenomic data analysis. Assignments can be made using rank-specific classifiers, which assign reads to taxonomic labels from a predetermined level such as named species or strain, or rank-flexible classifiers, which choose an appropriate taxonomic rank for each sequence in a data set. The choice of rank typically depends on the optimal model for a given sequence and on the breadth of taxonomic groups seen in a set of close-to-optimal models. Homology-based (e.g., LCA) and composition-based (e.g., PhyloPythia, TACOA) rank-flexible classifiers have been proposed, but there is at present no hybrid approach that utilizes both homology and composition. RESULTS: We first develop a hybrid, rank-specific classifier based on BLAST and Naïve Bayes (NB) that has comparable accuracy and a faster running time than the current best approach, PhymmBL. By substituting LCA for BLAST or allowing the inclusion of suboptimal NB models, we obtain a rank-flexible classifier. This hybrid classifier outperforms established rank-flexible approaches on simulated metagenomic fragments of length 200 bp to 1000 bp and is able to assign taxonomic attributions to a subset of sequences with few misclassifications. We then demonstrate the performance of different classifiers on an enhanced biological phosphorous removal metagenome, illustrating the advantages of rank-flexible classifiers when representative genomes are absent from the set of reference genomes. Application to a glacier ice metagenome demonstrates that similar taxonomic profiles are obtained across a set of classifiers which are increasingly conservative in their classification. CONCLUSIONS: Our NB-based classification scheme is faster than the current best composition-based algorithm, Phymm, while providing equally accurate predictions. The rank-flexible variant of NB, which we term ε-NB, is complementary to LCA and can be combined with it to yield conservative prediction sets of very high confidence. The simple parameterization of LCA and ε-NB allows for tuning of the balance between more predictions and increased precision, allowing the user to account for the sensitivity of downstream analyses to misclassified or unclassified sequences. BioMed Central 2011-08-09 /pmc/articles/PMC3173459/ /pubmed/21827705 http://dx.doi.org/10.1186/1471-2105-12-328 Text en Copyright ©2011 Parks et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Parks, Donovan H
MacDonald, Norman J
Beiko, Robert G
Classifying short genomic fragments from novel lineages using composition and homology
title Classifying short genomic fragments from novel lineages using composition and homology
title_full Classifying short genomic fragments from novel lineages using composition and homology
title_fullStr Classifying short genomic fragments from novel lineages using composition and homology
title_full_unstemmed Classifying short genomic fragments from novel lineages using composition and homology
title_short Classifying short genomic fragments from novel lineages using composition and homology
title_sort classifying short genomic fragments from novel lineages using composition and homology
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3173459/
https://www.ncbi.nlm.nih.gov/pubmed/21827705
http://dx.doi.org/10.1186/1471-2105-12-328
work_keys_str_mv AT parksdonovanh classifyingshortgenomicfragmentsfromnovellineagesusingcompositionandhomology
AT macdonaldnormanj classifyingshortgenomicfragmentsfromnovellineagesusingcompositionandhomology
AT beikorobertg classifyingshortgenomicfragmentsfromnovellineagesusingcompositionandhomology