Cargando…

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algori...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lupo, Valérian, Van Vlierberghe, Mick, Vanderschuren, Hervé, Kerff, Frédéric, Baurain, Denis, Cornet, Luc
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Frontiers Media S.A. 2021
Materias:	Microbiology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8570097/ https://www.ncbi.nlm.nih.gov/pubmed/34745061 http://dx.doi.org/10.3389/fmicb.2021.755101

_version_	1784594772892057600
author	Lupo, Valérian Van Vlierberghe, Mick Vanderschuren, Hervé Kerff, Frédéric Baurain, Denis Cornet, Luc
author_facet	Lupo, Valérian Van Vlierberghe, Mick Vanderschuren, Hervé Kerff, Frédéric Baurain, Denis Cornet, Luc
author_sort	Lupo, Valérian
collection	PubMed
description	Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases.
format	Online Article Text
id	pubmed-8570097
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Frontiers Media S.A.
record_format	MEDLINE/PubMed
spelling	pubmed-85700972021-11-06 Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics Lupo, Valérian Van Vlierberghe, Mick Vanderschuren, Hervé Kerff, Frédéric Baurain, Denis Cornet, Luc Front Microbiol Microbiology Contaminating sequences in public genome databases is a pervasive issue with potentially far-reaching consequences. This problem has attracted much attention in the recent literature and many different tools are now available to detect contaminants. Although these methods are based on diverse algorithms that can sometimes produce widely different estimates of the contamination level, the majority of genomic studies rely on a single method of detection, which represents a risk of systematic error. In this work, we used two orthogonal methods to assess the level of contamination among National Center for Biotechnological Information Reference Sequence Database (RefSeq) bacterial genomes. First, we applied the most popular solution, CheckM, which is based on gene markers. We then complemented this approach by a genome-wide method, termed Physeter, which now implements a k-folds algorithm to avoid inaccurate detection due to potential contamination of the reference database. We demonstrate that CheckM cannot currently be applied to all available genomes and bacterial groups. While it performed well on the majority of RefSeq genomes, it produced dubious results for 12,326 organisms. Among those, Physeter identified 239 contaminated genomes that had been missed by CheckM. In conclusion, we emphasize the importance of using multiple methods of detection while providing an upgrade of our own detection tool, Physeter, which minimizes incorrect contamination estimates in the context of unavoidably contaminated reference databases. Frontiers Media S.A. 2021-10-22 /pmc/articles/PMC8570097/ /pubmed/34745061 http://dx.doi.org/10.3389/fmicb.2021.755101 Text en Copyright © 2021 Lupo, Van Vlierberghe, Vanderschuren, Kerff, Baurain and Cornet. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle	Microbiology Lupo, Valérian Van Vlierberghe, Mick Vanderschuren, Hervé Kerff, Frédéric Baurain, Denis Cornet, Luc Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_full	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_fullStr	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_full_unstemmed	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_short	Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics
title_sort	contamination in reference sequence databases: time for divide-and-rule tactics
topic	Microbiology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8570097/ https://www.ncbi.nlm.nih.gov/pubmed/34745061 http://dx.doi.org/10.3389/fmicb.2021.755101
work_keys_str_mv	AT lupovalerian contaminationinreferencesequencedatabasestimefordivideandruletactics AT vanvlierberghemick contaminationinreferencesequencedatabasestimefordivideandruletactics AT vanderschurenherve contaminationinreferencesequencedatabasestimefordivideandruletactics AT kerfffrederic contaminationinreferencesequencedatabasestimefordivideandruletactics AT bauraindenis contaminationinreferencesequencedatabasestimefordivideandruletactics AT cornetluc contaminationinreferencesequencedatabasestimefordivideandruletactics

Contamination in Reference Sequence Databases: Time for Divide-and-Rule Tactics

Ejemplares similares