Cargando…

MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning

BACKGROUND: Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (~100 bp) are short and most of the...

Descripción completa

Detalles Bibliográficos
Autores principales: Wang, Yi, Leung, Henry Chi Ming, Yiu, Siu Ming, Chin, Francis Yuk Lun
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4046714/
https://www.ncbi.nlm.nih.gov/pubmed/24564377
http://dx.doi.org/10.1186/1471-2164-15-S1-S12
_version_ 1782480302954774528
author Wang, Yi
Leung, Henry Chi Ming
Yiu, Siu Ming
Chin, Francis Yuk Lun
author_facet Wang, Yi
Leung, Henry Chi Ming
Yiu, Siu Ming
Chin, Francis Yuk Lun
author_sort Wang, Yi
collection PubMed
description BACKGROUND: Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (~100 bp) are short and most of them come from unknown genomes. Previous work has suggested assembling the reads to make longer contigs before annotation. More reads/contigs can be annotated as a longer contig (in Kbp) can be aligned to a taxon even if it is from an unknown species as long as it contains a conserved region of that taxon. Unfortunately existing metagenomic assembly tools are not mature enough to produce long enough contigs. Binning tries to group reads/contigs of similar species together. Intuitively, reads in the same group (cluster) should be annotated to the same taxon and these reads altogether should cover a significant portion of the genome alleviating the problem of short contigs if the quality of binning is high. However, no existing work has tried to use binning results to help solve the annotation problem. This work explores this direction. RESULTS: In this paper, we describe MetaCluster-TA, an assembly-assisted binning-based annotation tool which relies on an innovative idea of annotating binned reads instead of aligning each read or contig to the taxonomic structure separately. We propose the novel concept of the 'virtual contig' (which can be up to 10 Kb in length) to represent a set of reads and then represent each cluster as a set of 'virtual contigs' (which together can be total up to 1 Mb in length) for annotation. MetaCluster-TA can outperform widely-used MEGAN4 and can annotate (1) more reads since the virtual contigs are much longer; (2) more accurately since each cluster of long virtual contigs contains global information of the sampled genome which tends to be more accurate than short reads or assembled contigs which contain only local information of the genome; and (3) more efficiently since there are much fewer long virtual contigs to align than short reads. MetaCluster-TA outperforms MetaCluster 5.0 as a binning tool since binning itself can be more sensitive and precise given long virtual contigs and the binning results can be improved using the reference taxonomic database. CONCLUSIONS: MetaCluster-TA can outperform widely-used MEGAN4 and can annotate more reads with higher accuracy and higher efficiency. It also outperforms MetaCluster 5.0 as a binning tool.
format Online
Article
Text
id pubmed-4046714
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-40467142014-06-06 MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning Wang, Yi Leung, Henry Chi Ming Yiu, Siu Ming Chin, Francis Yuk Lun BMC Genomics Proceedings BACKGROUND: Taxonomic annotation of reads is an important problem in metagenomic analysis. Existing annotation tools, which rely on the approach of aligning each read to the taxonomic structure, are unable to annotate many reads efficiently and accurately as reads (~100 bp) are short and most of them come from unknown genomes. Previous work has suggested assembling the reads to make longer contigs before annotation. More reads/contigs can be annotated as a longer contig (in Kbp) can be aligned to a taxon even if it is from an unknown species as long as it contains a conserved region of that taxon. Unfortunately existing metagenomic assembly tools are not mature enough to produce long enough contigs. Binning tries to group reads/contigs of similar species together. Intuitively, reads in the same group (cluster) should be annotated to the same taxon and these reads altogether should cover a significant portion of the genome alleviating the problem of short contigs if the quality of binning is high. However, no existing work has tried to use binning results to help solve the annotation problem. This work explores this direction. RESULTS: In this paper, we describe MetaCluster-TA, an assembly-assisted binning-based annotation tool which relies on an innovative idea of annotating binned reads instead of aligning each read or contig to the taxonomic structure separately. We propose the novel concept of the 'virtual contig' (which can be up to 10 Kb in length) to represent a set of reads and then represent each cluster as a set of 'virtual contigs' (which together can be total up to 1 Mb in length) for annotation. MetaCluster-TA can outperform widely-used MEGAN4 and can annotate (1) more reads since the virtual contigs are much longer; (2) more accurately since each cluster of long virtual contigs contains global information of the sampled genome which tends to be more accurate than short reads or assembled contigs which contain only local information of the genome; and (3) more efficiently since there are much fewer long virtual contigs to align than short reads. MetaCluster-TA outperforms MetaCluster 5.0 as a binning tool since binning itself can be more sensitive and precise given long virtual contigs and the binning results can be improved using the reference taxonomic database. CONCLUSIONS: MetaCluster-TA can outperform widely-used MEGAN4 and can annotate more reads with higher accuracy and higher efficiency. It also outperforms MetaCluster 5.0 as a binning tool. BioMed Central 2014-01-24 /pmc/articles/PMC4046714/ /pubmed/24564377 http://dx.doi.org/10.1186/1471-2164-15-S1-S12 Text en © Wang et al.; licensee BioMed Central Ltd. 2014 This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Wang, Yi
Leung, Henry Chi Ming
Yiu, Siu Ming
Chin, Francis Yuk Lun
MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning
title MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning
title_full MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning
title_fullStr MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning
title_full_unstemmed MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning
title_short MetaCluster-TA: taxonomic annotation for metagenomic data based on assembly-assisted binning
title_sort metacluster-ta: taxonomic annotation for metagenomic data based on assembly-assisted binning
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4046714/
https://www.ncbi.nlm.nih.gov/pubmed/24564377
http://dx.doi.org/10.1186/1471-2164-15-S1-S12
work_keys_str_mv AT wangyi metaclustertataxonomicannotationformetagenomicdatabasedonassemblyassistedbinning
AT leunghenrychiming metaclustertataxonomicannotationformetagenomicdatabasedonassemblyassistedbinning
AT yiusiuming metaclustertataxonomicannotationformetagenomicdatabasedonassemblyassistedbinning
AT chinfrancisyuklun metaclustertataxonomicannotationformetagenomicdatabasedonassemblyassistedbinning