Cargando…

Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies

Thanks to huge advances in sequencing technologies, genomic resources are increasingly being generated and shared by the scientific community. The quality of such public resources are therefore of critical importance. Errors due to contamination are particularly worrying; they are widespread, propag...

Descripción completa

Detalles Bibliográficos
Autores principales: Francois, Clementine M., Durand, Faustine, Figuet, Emeric, Galtier, Nicolas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Genetics Society of America 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7003083/
https://www.ncbi.nlm.nih.gov/pubmed/31862787
http://dx.doi.org/10.1534/g3.119.400758
_version_ 1783494471729545216
author Francois, Clementine M.
Durand, Faustine
Figuet, Emeric
Galtier, Nicolas
author_facet Francois, Clementine M.
Durand, Faustine
Figuet, Emeric
Galtier, Nicolas
author_sort Francois, Clementine M.
collection PubMed
description Thanks to huge advances in sequencing technologies, genomic resources are increasingly being generated and shared by the scientific community. The quality of such public resources are therefore of critical importance. Errors due to contamination are particularly worrying; they are widespread, propagate across databases, and can compromise downstream analyses, especially the detection of horizontally-transferred sequences. However we still lack consistent and comprehensive assessments of contamination prevalence in public genomic data. Here we applied a standardized procedure for foreign sequence annotation to 43 published arthropod genomes from the widely used Ensembl Metazoa database. This method combines information on sequence similarity and synteny to identify contaminant and putative horizontally-transferred sequences in any genome assembly, provided that an adequate reference database is available. We uncovered considerable heterogeneity in quality among arthropod assemblies, some being devoid of contaminant sequences, whereas others included hundreds of contaminant genes. Contaminants far outnumbered horizontally-transferred genes and were a major confounder of their detection, quantification and analysis. We strongly recommend that automated standardized decontamination procedures be systematically embedded into the submission process to genomic databases.
format Online
Article
Text
id pubmed-7003083
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Genetics Society of America
record_format MEDLINE/PubMed
spelling pubmed-70030832020-02-14 Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies Francois, Clementine M. Durand, Faustine Figuet, Emeric Galtier, Nicolas G3 (Bethesda) Investigations Thanks to huge advances in sequencing technologies, genomic resources are increasingly being generated and shared by the scientific community. The quality of such public resources are therefore of critical importance. Errors due to contamination are particularly worrying; they are widespread, propagate across databases, and can compromise downstream analyses, especially the detection of horizontally-transferred sequences. However we still lack consistent and comprehensive assessments of contamination prevalence in public genomic data. Here we applied a standardized procedure for foreign sequence annotation to 43 published arthropod genomes from the widely used Ensembl Metazoa database. This method combines information on sequence similarity and synteny to identify contaminant and putative horizontally-transferred sequences in any genome assembly, provided that an adequate reference database is available. We uncovered considerable heterogeneity in quality among arthropod assemblies, some being devoid of contaminant sequences, whereas others included hundreds of contaminant genes. Contaminants far outnumbered horizontally-transferred genes and were a major confounder of their detection, quantification and analysis. We strongly recommend that automated standardized decontamination procedures be systematically embedded into the submission process to genomic databases. Genetics Society of America 2019-12-20 /pmc/articles/PMC7003083/ /pubmed/31862787 http://dx.doi.org/10.1534/g3.119.400758 Text en Copyright © 2020 Francois et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Investigations
Francois, Clementine M.
Durand, Faustine
Figuet, Emeric
Galtier, Nicolas
Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies
title Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies
title_full Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies
title_fullStr Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies
title_full_unstemmed Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies
title_short Prevalence and Implications of Contamination in Public Genomic Resources: A Case Study of 43 Reference Arthropod Assemblies
title_sort prevalence and implications of contamination in public genomic resources: a case study of 43 reference arthropod assemblies
topic Investigations
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7003083/
https://www.ncbi.nlm.nih.gov/pubmed/31862787
http://dx.doi.org/10.1534/g3.119.400758
work_keys_str_mv AT francoisclementinem prevalenceandimplicationsofcontaminationinpublicgenomicresourcesacasestudyof43referencearthropodassemblies
AT durandfaustine prevalenceandimplicationsofcontaminationinpublicgenomicresourcesacasestudyof43referencearthropodassemblies
AT figuetemeric prevalenceandimplicationsofcontaminationinpublicgenomicresourcesacasestudyof43referencearthropodassemblies
AT galtiernicolas prevalenceandimplicationsofcontaminationinpublicgenomicresourcesacasestudyof43referencearthropodassemblies