Cargando…

LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes

Motivation: A perennial problem in the analysis of environmental sequence information is the assignment of reads or assembled sequences, e.g. contigs or scaffolds, to discrete taxonomic bins. In the absence of reference genomes for most environmental microorganisms, the use of intrinsic nucleotide p...

Descripción completa

Detalles Bibliográficos
Autores principales: Hanson, Niels W., Konwar, Kishori M., Hallam, Steven J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5181528/
https://www.ncbi.nlm.nih.gov/pubmed/27515739
http://dx.doi.org/10.1093/bioinformatics/btw400
_version_ 1782485727822479360
author Hanson, Niels W.
Konwar, Kishori M.
Hallam, Steven J.
author_facet Hanson, Niels W.
Konwar, Kishori M.
Hallam, Steven J.
author_sort Hanson, Niels W.
collection PubMed
description Motivation: A perennial problem in the analysis of environmental sequence information is the assignment of reads or assembled sequences, e.g. contigs or scaffolds, to discrete taxonomic bins. In the absence of reference genomes for most environmental microorganisms, the use of intrinsic nucleotide patterns and phylogenetic anchors can improve assembly-dependent binning needed for more accurate taxonomic and functional annotation in communities of microorganisms, and assist in identifying mobile genetic elements or lateral gene transfer events. Results: Here, we present a statistic called LCA* inspired by Information and Voting theories that uses the NCBI Taxonomic Database hierarchy to assign taxonomy to contigs assembled from environmental sequence information. The LCA* algorithm identifies a sufficiently strong majority on the hierarchy while minimizing entropy changes to the observed taxonomic distribution resulting in improved statistical properties. Moreover, we apply results from the order-statistic literature to formulate a likelihood-ratio hypothesis test and P-value for testing the supremacy of the assigned LCA* taxonomy. Using simulated and real-world datasets, we empirically demonstrate that voting-based methods, majority vote and LCA*, in the presence of known reference annotations, are consistently more accurate in identifying contig taxonomy than the lowest common ancestor algorithm popularized by MEGAN, and that LCA* taxonomy strikes a balance between specificity and confidence to provide an estimate appropriate to the available information in the data. Availability and Implementation: The LCA* has been implemented as a stand-alone Python library compatible with the MetaPathways pipeline; both of which are available on GitHub with installation instructions and use-cases (http://www.github.com/hallamlab/LCAStar/). Contact: shallam@mail.ubc.ca Supplementary information: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-5181528
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-51815282016-12-27 LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes Hanson, Niels W. Konwar, Kishori M. Hallam, Steven J. Bioinformatics Original Papers Motivation: A perennial problem in the analysis of environmental sequence information is the assignment of reads or assembled sequences, e.g. contigs or scaffolds, to discrete taxonomic bins. In the absence of reference genomes for most environmental microorganisms, the use of intrinsic nucleotide patterns and phylogenetic anchors can improve assembly-dependent binning needed for more accurate taxonomic and functional annotation in communities of microorganisms, and assist in identifying mobile genetic elements or lateral gene transfer events. Results: Here, we present a statistic called LCA* inspired by Information and Voting theories that uses the NCBI Taxonomic Database hierarchy to assign taxonomy to contigs assembled from environmental sequence information. The LCA* algorithm identifies a sufficiently strong majority on the hierarchy while minimizing entropy changes to the observed taxonomic distribution resulting in improved statistical properties. Moreover, we apply results from the order-statistic literature to formulate a likelihood-ratio hypothesis test and P-value for testing the supremacy of the assigned LCA* taxonomy. Using simulated and real-world datasets, we empirically demonstrate that voting-based methods, majority vote and LCA*, in the presence of known reference annotations, are consistently more accurate in identifying contig taxonomy than the lowest common ancestor algorithm popularized by MEGAN, and that LCA* taxonomy strikes a balance between specificity and confidence to provide an estimate appropriate to the available information in the data. Availability and Implementation: The LCA* has been implemented as a stand-alone Python library compatible with the MetaPathways pipeline; both of which are available on GitHub with installation instructions and use-cases (http://www.github.com/hallamlab/LCAStar/). Contact: shallam@mail.ubc.ca Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2016-12-01 2016-08-11 /pmc/articles/PMC5181528/ /pubmed/27515739 http://dx.doi.org/10.1093/bioinformatics/btw400 Text en © The Author 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Hanson, Niels W.
Konwar, Kishori M.
Hallam, Steven J.
LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes
title LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes
title_full LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes
title_fullStr LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes
title_full_unstemmed LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes
title_short LCA*: an entropy-based measure for taxonomic assignment within assembled metagenomes
title_sort lca*: an entropy-based measure for taxonomic assignment within assembled metagenomes
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5181528/
https://www.ncbi.nlm.nih.gov/pubmed/27515739
http://dx.doi.org/10.1093/bioinformatics/btw400
work_keys_str_mv AT hansonnielsw lcaanentropybasedmeasurefortaxonomicassignmentwithinassembledmetagenomes
AT konwarkishorim lcaanentropybasedmeasurefortaxonomicassignmentwithinassembledmetagenomes
AT hallamstevenj lcaanentropybasedmeasurefortaxonomicassignmentwithinassembledmetagenomes