Cargando…

Artificial and natural duplicates in pyrosequencing reads of metagenomic data

BACKGROUND: Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. However, since the duplicated reads observed in a pyrosequencing run also in...

Descripción completa

Detalles Bibliográficos
Autores principales:	Niu, Beifang, Fu, Limin, Sun, Shulei, Li, Weizhong
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874554/ https://www.ncbi.nlm.nih.gov/pubmed/20388221 http://dx.doi.org/10.1186/1471-2105-11-187

_version_	1782181494926606336
author	Niu, Beifang Fu, Limin Sun, Shulei Li, Weizhong
author_facet	Niu, Beifang Fu, Limin Sun, Shulei Li, Weizhong
author_sort	Niu, Beifang
collection	PubMed
description	BACKGROUND: Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates. RESULTS: We implemented a method for identification of exact and nearly identical duplicates from pyrosequencing reads. This method performs an all-against-all sequence comparison and clusters the duplicates into groups using an algorithm modified from our previous sequence clustering method cd-hit. This method can process a typical dataset in ~10 minutes; it also provides a consensus sequence for each group of duplicates. We applied this method to the underlying raw reads of 39 genomic projects and 10 metagenomic projects that utilized pyrosequencing technique. We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations. We observed that the duplicates, including both artificial and natural duplicates, make up 4-44% of reads. The number of natural duplicates highly correlates with the samples' read density (number of reads divided by genome size). For high-complexity metagenomic samples lacking dominant species, natural duplicates only make up <1% of all duplicates. But for some other samples like transcriptomic samples, majority of the observed duplicates might be natural duplicates. CONCLUSIONS: Our method is available from http://cd-hit.org as a downloadable program and a web server. It is important not only to identify the duplicates from metagenomic datasets but also to distinguish whether they are artificial or natural duplicates. We provide a tool to estimate the number of natural duplicates according to user-defined sample types, so users can decide whether to retain or remove duplicates in their projects.
format	Text
id	pubmed-2874554
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28745542010-05-22 Artificial and natural duplicates in pyrosequencing reads of metagenomic data Niu, Beifang Fu, Limin Sun, Shulei Li, Weizhong BMC Bioinformatics Research article BACKGROUND: Artificial duplicates from pyrosequencing reads may lead to incorrect interpretation of the abundance of species and genes in metagenomic studies. Duplicated reads were filtered out in many metagenomic projects. However, since the duplicated reads observed in a pyrosequencing run also include natural (non-artificial) duplicates, simply removing all duplicates may also cause underestimation of abundance associated with natural duplicates. RESULTS: We implemented a method for identification of exact and nearly identical duplicates from pyrosequencing reads. This method performs an all-against-all sequence comparison and clusters the duplicates into groups using an algorithm modified from our previous sequence clustering method cd-hit. This method can process a typical dataset in ~10 minutes; it also provides a consensus sequence for each group of duplicates. We applied this method to the underlying raw reads of 39 genomic projects and 10 metagenomic projects that utilized pyrosequencing technique. We compared the occurrences of the duplicates identified by our method and the natural duplicates made by independent simulations. We observed that the duplicates, including both artificial and natural duplicates, make up 4-44% of reads. The number of natural duplicates highly correlates with the samples' read density (number of reads divided by genome size). For high-complexity metagenomic samples lacking dominant species, natural duplicates only make up <1% of all duplicates. But for some other samples like transcriptomic samples, majority of the observed duplicates might be natural duplicates. CONCLUSIONS: Our method is available from http://cd-hit.org as a downloadable program and a web server. It is important not only to identify the duplicates from metagenomic datasets but also to distinguish whether they are artificial or natural duplicates. We provide a tool to estimate the number of natural duplicates according to user-defined sample types, so users can decide whether to retain or remove duplicates in their projects. BioMed Central 2010-04-13 /pmc/articles/PMC2874554/ /pubmed/20388221 http://dx.doi.org/10.1186/1471-2105-11-187 Text en Copyright ©2010 Niu et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research article Niu, Beifang Fu, Limin Sun, Shulei Li, Weizhong Artificial and natural duplicates in pyrosequencing reads of metagenomic data
title	Artificial and natural duplicates in pyrosequencing reads of metagenomic data
title_full	Artificial and natural duplicates in pyrosequencing reads of metagenomic data
title_fullStr	Artificial and natural duplicates in pyrosequencing reads of metagenomic data
title_full_unstemmed	Artificial and natural duplicates in pyrosequencing reads of metagenomic data
title_short	Artificial and natural duplicates in pyrosequencing reads of metagenomic data
title_sort	artificial and natural duplicates in pyrosequencing reads of metagenomic data
topic	Research article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2874554/ https://www.ncbi.nlm.nih.gov/pubmed/20388221 http://dx.doi.org/10.1186/1471-2105-11-187
work_keys_str_mv	AT niubeifang artificialandnaturalduplicatesinpyrosequencingreadsofmetagenomicdata AT fulimin artificialandnaturalduplicatesinpyrosequencingreadsofmetagenomicdata AT sunshulei artificialandnaturalduplicatesinpyrosequencingreadsofmetagenomicdata AT liweizhong artificialandnaturalduplicatesinpyrosequencingreadsofmetagenomicdata

Artificial and natural duplicates in pyrosequencing reads of metagenomic data

Ejemplares similares