Cargando…

Human contamination in bacterial genomes has created thousands of spurious proteins

Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes...

Descripción completa

Detalles Bibliográficos
Autores principales: Breitwieser, Florian P., Pertea, Mihaela, Zimin, Aleksey V., Salzberg, Steven L.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6581058/
https://www.ncbi.nlm.nih.gov/pubmed/31064768
http://dx.doi.org/10.1101/gr.245373.118
_version_ 1783428125437198336
author Breitwieser, Florian P.
Pertea, Mihaela
Zimin, Aleksey V.
Salzberg, Steven L.
author_facet Breitwieser, Florian P.
Pertea, Mihaela
Zimin, Aleksey V.
Salzberg, Steven L.
author_sort Breitwieser, Florian P.
collection PubMed
description Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein “families” across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences.
format Online
Article
Text
id pubmed-6581058
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-65810582019-12-01 Human contamination in bacterial genomes has created thousands of spurious proteins Breitwieser, Florian P. Pertea, Mihaela Zimin, Aleksey V. Salzberg, Steven L. Genome Res Research Contaminant sequences that appear in published genomes can cause numerous problems for downstream analyses, particularly for evolutionary studies and metagenomics projects. Our large-scale scan of complete and draft bacterial and archaeal genomes in the NCBI RefSeq database reveals that 2250 genomes are contaminated by human sequence. The contaminant sequences derive primarily from high-copy human repeat regions, which themselves are not adequately represented in the current human reference genome, GRCh38. The absence of the sequences from the human assembly offers a likely explanation for their presence in bacterial assemblies. In some cases, the contaminating contigs have been erroneously annotated as containing protein-coding sequences, which over time have propagated to create spurious protein “families” across multiple prokaryotic and eukaryotic genomes. As a result, 3437 spurious protein entries are currently present in the widely used nr and TrEMBL protein databases. We report here an extensive list of contaminant sequences in bacterial genome assemblies and the proteins associated with them. We found that nearly all contaminants occurred in small contigs in draft genomes, which suggests that filtering out small contigs from draft genome assemblies may mitigate the issue of contamination while still keeping nearly all of the genuine genomic sequences. Cold Spring Harbor Laboratory Press 2019-06 /pmc/articles/PMC6581058/ /pubmed/31064768 http://dx.doi.org/10.1101/gr.245373.118 Text en © 2019 Breitwieser et al.; Published by Cold Spring Harbor Laboratory Press http://creativecommons.org/licenses/by-nc/4.0/ This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.
spellingShingle Research
Breitwieser, Florian P.
Pertea, Mihaela
Zimin, Aleksey V.
Salzberg, Steven L.
Human contamination in bacterial genomes has created thousands of spurious proteins
title Human contamination in bacterial genomes has created thousands of spurious proteins
title_full Human contamination in bacterial genomes has created thousands of spurious proteins
title_fullStr Human contamination in bacterial genomes has created thousands of spurious proteins
title_full_unstemmed Human contamination in bacterial genomes has created thousands of spurious proteins
title_short Human contamination in bacterial genomes has created thousands of spurious proteins
title_sort human contamination in bacterial genomes has created thousands of spurious proteins
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6581058/
https://www.ncbi.nlm.nih.gov/pubmed/31064768
http://dx.doi.org/10.1101/gr.245373.118
work_keys_str_mv AT breitwieserflorianp humancontaminationinbacterialgenomeshascreatedthousandsofspuriousproteins
AT perteamihaela humancontaminationinbacterialgenomeshascreatedthousandsofspuriousproteins
AT ziminalekseyv humancontaminationinbacterialgenomeshascreatedthousandsofspuriousproteins
AT salzbergstevenl humancontaminationinbacterialgenomeshascreatedthousandsofspuriousproteins