Cargando…

Flexible taxonomic assignment of ambiguous sequencing reads

BACKGROUND: To characterize the diversity of bacterial populations in metagenomic studies, sequencing reads need to be accurately assigned to taxonomic units in a given reference taxonomy. Reads that cannot be reliably assigned to a unique leaf in the taxonomy (ambiguous reads) are typically assigne...

Descripción completa

Detalles Bibliográficos
Autores principales: Clemente, José C, Jansson, Jesper, Valiente, Gabriel
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3024944/
https://www.ncbi.nlm.nih.gov/pubmed/21211059
http://dx.doi.org/10.1186/1471-2105-12-8
_version_ 1782196839006601216
author Clemente, José C
Jansson, Jesper
Valiente, Gabriel
author_facet Clemente, José C
Jansson, Jesper
Valiente, Gabriel
author_sort Clemente, José C
collection PubMed
description BACKGROUND: To characterize the diversity of bacterial populations in metagenomic studies, sequencing reads need to be accurately assigned to taxonomic units in a given reference taxonomy. Reads that cannot be reliably assigned to a unique leaf in the taxonomy (ambiguous reads) are typically assigned to the lowest common ancestor of the set of species that match it. This introduces a potentially severe error in the estimation of bacteria present in the sample due to false positives, since all species in the subtree rooted at the ancestor are implicitly assigned to the read even though many of them may not match it. RESULTS: We present a method that maps each read to a node in the taxonomy that minimizes a penalty score while balancing the relevance of precision and recall in the assignment through a parameter q. This mapping can be obtained in time linear in the number of matching sequences, because LCA queries to the reference taxonomy take constant time. When applied to six different metagenomic datasets, our algorithm produces different taxonomic distributions depending on whether coverage or precision is maximized. Including information on the quality of the reads reduces the number of unassigned reads but increases the number of ambiguous reads, stressing the relevance of our method. Finally, two measures of performance are described and results with a set of artificially generated datasets are discussed. CONCLUSIONS: The assignment strategy of sequencing reads introduced in this paper is a versatile and a quick method to study bacterial communities. The bacterial composition of the analyzed samples can vary significantly depending on how ambiguous reads are assigned depending on the value of the q parameter. Validation of our results in an artificial dataset confirm that a combination of values of q produces the most accurate results.
format Text
id pubmed-3024944
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30249442011-01-24 Flexible taxonomic assignment of ambiguous sequencing reads Clemente, José C Jansson, Jesper Valiente, Gabriel BMC Bioinformatics Research Article BACKGROUND: To characterize the diversity of bacterial populations in metagenomic studies, sequencing reads need to be accurately assigned to taxonomic units in a given reference taxonomy. Reads that cannot be reliably assigned to a unique leaf in the taxonomy (ambiguous reads) are typically assigned to the lowest common ancestor of the set of species that match it. This introduces a potentially severe error in the estimation of bacteria present in the sample due to false positives, since all species in the subtree rooted at the ancestor are implicitly assigned to the read even though many of them may not match it. RESULTS: We present a method that maps each read to a node in the taxonomy that minimizes a penalty score while balancing the relevance of precision and recall in the assignment through a parameter q. This mapping can be obtained in time linear in the number of matching sequences, because LCA queries to the reference taxonomy take constant time. When applied to six different metagenomic datasets, our algorithm produces different taxonomic distributions depending on whether coverage or precision is maximized. Including information on the quality of the reads reduces the number of unassigned reads but increases the number of ambiguous reads, stressing the relevance of our method. Finally, two measures of performance are described and results with a set of artificially generated datasets are discussed. CONCLUSIONS: The assignment strategy of sequencing reads introduced in this paper is a versatile and a quick method to study bacterial communities. The bacterial composition of the analyzed samples can vary significantly depending on how ambiguous reads are assigned depending on the value of the q parameter. Validation of our results in an artificial dataset confirm that a combination of values of q produces the most accurate results. BioMed Central 2011-01-07 /pmc/articles/PMC3024944/ /pubmed/21211059 http://dx.doi.org/10.1186/1471-2105-12-8 Text en Copyright ©2011 Clemente et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Clemente, José C
Jansson, Jesper
Valiente, Gabriel
Flexible taxonomic assignment of ambiguous sequencing reads
title Flexible taxonomic assignment of ambiguous sequencing reads
title_full Flexible taxonomic assignment of ambiguous sequencing reads
title_fullStr Flexible taxonomic assignment of ambiguous sequencing reads
title_full_unstemmed Flexible taxonomic assignment of ambiguous sequencing reads
title_short Flexible taxonomic assignment of ambiguous sequencing reads
title_sort flexible taxonomic assignment of ambiguous sequencing reads
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3024944/
https://www.ncbi.nlm.nih.gov/pubmed/21211059
http://dx.doi.org/10.1186/1471-2105-12-8
work_keys_str_mv AT clementejosec flexibletaxonomicassignmentofambiguoussequencingreads
AT janssonjesper flexibletaxonomicassignmentofambiguoussequencingreads
AT valientegabriel flexibletaxonomicassignmentofambiguoussequencingreads