Cargando…
Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank
Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports co...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7218494/ https://www.ncbi.nlm.nih.gov/pubmed/32398145 http://dx.doi.org/10.1186/s13059-020-02023-1 |
_version_ | 1783532809138208768 |
---|---|
author | Steinegger, Martin Salzberg, Steven L. |
author_facet | Steinegger, Martin Salzberg, Steven L. |
author_sort | Steinegger, Martin |
collection | PubMed |
description | Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator |
format | Online Article Text |
id | pubmed-7218494 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-72184942020-05-18 Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank Steinegger, Martin Salzberg, Steven L. Genome Biol Method Genomic analyses are sensitive to contamination in public databases caused by incorrectly labeled reference sequences. Here, we describe Conterminator, an efficient method to detect and remove incorrectly labeled sequences by an exhaustive all-against-all sequence comparison. Our analysis reports contamination of 2,161,746, 114,035, and 14,148 sequences in the RefSeq, GenBank, and NR databases, respectively, spanning the whole range from draft to “complete” model organism genomes. Our method scales linearly with input size and can process 3.3 TB in 12 days on a 32-core computer. Conterminator can help ensure the quality of reference databases. Source code (GPLv3): https://github.com/martin-steinegger/conterminator BioMed Central 2020-05-12 /pmc/articles/PMC7218494/ /pubmed/32398145 http://dx.doi.org/10.1186/s13059-020-02023-1 Text en © The Author(s) 2020 Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Method Steinegger, Martin Salzberg, Steven L. Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank |
title | Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank |
title_full | Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank |
title_fullStr | Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank |
title_full_unstemmed | Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank |
title_short | Terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in GenBank |
title_sort | terminating contamination: large-scale search identifies more than 2,000,000 contaminated entries in genbank |
topic | Method |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7218494/ https://www.ncbi.nlm.nih.gov/pubmed/32398145 http://dx.doi.org/10.1186/s13059-020-02023-1 |
work_keys_str_mv | AT steineggermartin terminatingcontaminationlargescalesearchidentifiesmorethan2000000contaminatedentriesingenbank AT salzbergstevenl terminatingcontaminationlargescalesearchidentifiesmorethan2000000contaminatedentriesingenbank |