Cargando…

Gene Unprediction with Spurio: A tool to identify spurious protein sequences

We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their tran...

Descripción completa

Detalles Bibliográficos
Autores principales:	Höps, Wolfram, Jeffryes, Matt, Bateman, Alex
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	F1000 Research Limited 2018
Materias:	Method Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5897793/ https://www.ncbi.nlm.nih.gov/pubmed/29721311 http://dx.doi.org/10.12688/f1000research.14050.1

_version_	1783314014061723648
author	Höps, Wolfram Jeffryes, Matt Bateman, Alex
author_facet	Höps, Wolfram Jeffryes, Matt Bateman, Alex
author_sort	Höps, Wolfram
collection	PubMed
description	We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation. Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases. We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes. Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence’s likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio
format	Online Article Text
id	pubmed-5897793
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	F1000 Research Limited
record_format	MEDLINE/PubMed
spelling	pubmed-58977932018-05-01 Gene Unprediction with Spurio: A tool to identify spurious protein sequences Höps, Wolfram Jeffryes, Matt Bateman, Alex F1000Res Method Article We now have access to the sequences of tens of millions of proteins. These protein sequences are essential for modern molecular biology and computational biology. The vast majority of protein sequences are derived from gene prediction tools and have no experimental supporting evidence for their translation. Despite the increasing accuracy of gene prediction tools there likely exists a large number of spurious protein predictions in the sequence databases. We have developed the Spurio tool to help identify spurious protein predictions in prokaryotes. Spurio searches the query protein sequence against a prokaryotic nucleotide database using tblastn and identifies homologous sequences. The tblastn matches are used to score the query sequence’s likelihood of being a spurious protein prediction using a Gaussian process model. The most informative feature is the appearance of stop codons within the presumed translation of homologous DNA sequences. Benchmarking shows that the Spurio tool is able to distinguish spurious from true proteins. However, transposon proteins are prone to be predicted as spurious because of the frequency of degraded homologs found in the DNA sequence databases. Our initial experiments suggest that less than 1% of the proteins in the UniProtKB sequence database are likely to be spurious and that Spurio is able to identify over 60 times more spurious proteins than the AntiFam resource. The Spurio software and source code is available under an MIT license at the following URL: https://bitbucket.org/bateman-group/spurio F1000 Research Limited 2018-03-02 /pmc/articles/PMC5897793/ /pubmed/29721311 http://dx.doi.org/10.12688/f1000research.14050.1 Text en Copyright: © 2018 Höps W et al. http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Method Article Höps, Wolfram Jeffryes, Matt Bateman, Alex Gene Unprediction with Spurio: A tool to identify spurious protein sequences
title	Gene Unprediction with Spurio: A tool to identify spurious protein sequences
title_full	Gene Unprediction with Spurio: A tool to identify spurious protein sequences
title_fullStr	Gene Unprediction with Spurio: A tool to identify spurious protein sequences
title_full_unstemmed	Gene Unprediction with Spurio: A tool to identify spurious protein sequences
title_short	Gene Unprediction with Spurio: A tool to identify spurious protein sequences
title_sort	gene unprediction with spurio: a tool to identify spurious protein sequences
topic	Method Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5897793/ https://www.ncbi.nlm.nih.gov/pubmed/29721311 http://dx.doi.org/10.12688/f1000research.14050.1
work_keys_str_mv	AT hopswolfram geneunpredictionwithspurioatooltoidentifyspuriousproteinsequences AT jeffryesmatt geneunpredictionwithspurioatooltoidentifyspuriousproteinsequences AT batemanalex geneunpredictionwithspurioatooltoidentifyspuriousproteinsequences

Gene Unprediction with Spurio: A tool to identify spurious protein sequences

Ejemplares similares