Cargando…

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated...

Descripción completa

Detalles Bibliográficos
Autores principales: Rubio, Alejandro, Mier, Pablo, Andrade-Navarro, Miguel A, Garzón, Andrés, Jiménez, Juan, Pérez-Pulido, Antonio J
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673337/
https://www.ncbi.nlm.nih.gov/pubmed/33206958
http://dx.doi.org/10.1093/database/baaa088
_version_ 1783611298702950400
author Rubio, Alejandro
Mier, Pablo
Andrade-Navarro, Miguel A
Garzón, Andrés
Jiménez, Juan
Pérez-Pulido, Antonio J
author_facet Rubio, Alejandro
Mier, Pablo
Andrade-Navarro, Miguel A
Garzón, Andrés
Jiménez, Juan
Pérez-Pulido, Antonio J
author_sort Rubio, Alejandro
collection PubMed
description The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error.
format Online
Article
Text
id pubmed-7673337
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-76733372020-11-24 CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats Rubio, Alejandro Mier, Pablo Andrade-Navarro, Miguel A Garzón, Andrés Jiménez, Juan Pérez-Pulido, Antonio J Database (Oxford) Original Article The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error. Oxford University Press 2020-11-18 /pmc/articles/PMC7673337/ /pubmed/33206958 http://dx.doi.org/10.1093/database/baaa088 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Rubio, Alejandro
Mier, Pablo
Andrade-Navarro, Miguel A
Garzón, Andrés
Jiménez, Juan
Pérez-Pulido, Antonio J
CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_full CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_fullStr CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_full_unstemmed CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_short CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_sort crispr sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673337/
https://www.ncbi.nlm.nih.gov/pubmed/33206958
http://dx.doi.org/10.1093/database/baaa088
work_keys_str_mv AT rubioalejandro crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats
AT mierpablo crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats
AT andradenavarromiguela crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats
AT garzonandres crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats
AT jimenezjuan crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats
AT perezpulidoantonioj crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats