Cargando…

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated...

Descripción completa

Detalles Bibliográficos
Autores principales:	Rubio, Alejandro, Mier, Pablo, Andrade-Navarro, Miguel A, Garzón, Andrés, Jiménez, Juan, Pérez-Pulido, Antonio J
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2020
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673337/ https://www.ncbi.nlm.nih.gov/pubmed/33206958 http://dx.doi.org/10.1093/database/baaa088

_version_	1783611298702950400
author	Rubio, Alejandro Mier, Pablo Andrade-Navarro, Miguel A Garzón, Andrés Jiménez, Juan Pérez-Pulido, Antonio J
author_facet	Rubio, Alejandro Mier, Pablo Andrade-Navarro, Miguel A Garzón, Andrés Jiménez, Juan Pérez-Pulido, Antonio J
author_sort	Rubio, Alejandro
collection	PubMed
description	The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error.
format	Online Article Text
id	pubmed-7673337
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-76733372020-11-24 CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats Rubio, Alejandro Mier, Pablo Andrade-Navarro, Miguel A Garzón, Andrés Jiménez, Juan Pérez-Pulido, Antonio J Database (Oxford) Original Article The genomics era is resulting in the generation of a plethora of biological sequences that are usually stored in public databases. There are many computational tools that facilitate the annotation of these sequences, but sometimes they produce mistakes that enter the databases and can be propagated when erroneous data are used for secondary analyses, such as gene prediction or homology searching. While developing a computational gene finder based on protein-coding sequences, we discovered that the reference UniProtKB protein database is contaminated with some spurious sequences translated from DNA containing clustered regularly interspaced short palindromic repeats. We therefore encourage developers of prokaryotic computational gene finders and protein database curators to consider this source of error. Oxford University Press 2020-11-18 /pmc/articles/PMC7673337/ /pubmed/33206958 http://dx.doi.org/10.1093/database/baaa088 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Rubio, Alejandro Mier, Pablo Andrade-Navarro, Miguel A Garzón, Andrés Jiménez, Juan Pérez-Pulido, Antonio J CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title	CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_full	CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_fullStr	CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_full_unstemmed	CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_short	CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
title_sort	crispr sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7673337/ https://www.ncbi.nlm.nih.gov/pubmed/33206958 http://dx.doi.org/10.1093/database/baaa088
work_keys_str_mv	AT rubioalejandro crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats AT mierpablo crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats AT andradenavarromiguela crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats AT garzonandres crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats AT jimenezjuan crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats AT perezpulidoantonioj crisprsequencesaresometimeserroneouslytranslatedandcancontaminatepublicdatabaseswithspuriousproteinscontainingspacedrepeats

CRISPR sequences are sometimes erroneously translated and can contaminate public databases with spurious proteins containing spaced repeats

Ejemplares similares