Cargando…

Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets

High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. T...

Descripción completa

Detalles Bibliográficos
Autores principales: Schmieder, Robert, Edwards, Robert
Formato: Texto
Lenguaje:English
Publicado: Public Library of Science 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3052304/
https://www.ncbi.nlm.nih.gov/pubmed/21408061
http://dx.doi.org/10.1371/journal.pone.0017288
_version_ 1782199657128001536
author Schmieder, Robert
Edwards, Robert
author_facet Schmieder, Robert
Edwards, Robert
author_sort Schmieder, Robert
collection PubMed
description High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets ([Image: see text]150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/.
format Text
id pubmed-3052304
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-30523042011-03-15 Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets Schmieder, Robert Edwards, Robert PLoS One Research Article High-throughput sequencing technologies have strongly impacted microbiology, providing a rapid and cost-effective way of generating draft genomes and exploring microbial diversity. However, sequences obtained from impure nucleic acid preparations may contain DNA from sources other than the sample. Those sequence contaminations are a serious concern to the quality of the data used for downstream analysis, causing misassembly of sequence contigs and erroneous conclusions. Therefore, the removal of sequence contaminants is a necessary and required step for all sequencing projects. We developed DeconSeq, a robust framework for the rapid, automated identification and removal of sequence contamination in longer-read datasets ([Image: see text]150 bp mean read length). DeconSeq is publicly available as standalone and web-based versions. The results can be exported for subsequent analysis, and the databases used for the web-based version are automatically updated on a regular basis. DeconSeq categorizes possible contamination sequences, eliminates redundant hits with higher similarity to non-contaminant genomes, and provides graphical visualizations of the alignment results and classifications. Using DeconSeq, we conducted an analysis of possible human DNA contamination in 202 previously published microbial and viral metagenomes and found possible contamination in 145 (72%) metagenomes with as high as 64% contaminating sequences. This new framework allows scientists to automatically detect and efficiently remove unwanted sequence contamination from their datasets while eliminating critical limitations of current methods. DeconSeq's web interface is simple and user-friendly. The standalone version allows offline analysis and integration into existing data processing pipelines. DeconSeq's results reveal whether the sequencing experiment has succeeded, whether the correct sample was sequenced, and whether the sample contains any sequence contamination from DNA preparation or host. In addition, the analysis of 202 metagenomes demonstrated significant contamination of the non-human associated metagenomes, suggesting that this method is appropriate for screening all metagenomes. DeconSeq is available at http://deconseq.sourceforge.net/. Public Library of Science 2011-03-09 /pmc/articles/PMC3052304/ /pubmed/21408061 http://dx.doi.org/10.1371/journal.pone.0017288 Text en Schmieder, Edwards. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle Research Article
Schmieder, Robert
Edwards, Robert
Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets
title Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets
title_full Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets
title_fullStr Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets
title_full_unstemmed Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets
title_short Fast Identification and Removal of Sequence Contamination from Genomic and Metagenomic Datasets
title_sort fast identification and removal of sequence contamination from genomic and metagenomic datasets
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3052304/
https://www.ncbi.nlm.nih.gov/pubmed/21408061
http://dx.doi.org/10.1371/journal.pone.0017288
work_keys_str_mv AT schmiederrobert fastidentificationandremovalofsequencecontaminationfromgenomicandmetagenomicdatasets
AT edwardsrobert fastidentificationandremovalofsequencecontaminationfromgenomicandmetagenomicdatasets