Cargando…

Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data

In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein seq...

Descripción completa

Detalles Bibliográficos
Autores principales:	Griss, Johannes, Côté, Richard G., Gerner, Christopher, Hermjakob, Henning, Vizcaíno, Juan Antonio
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	The American Society for Biochemistry and Molecular Biology 2011
Materias:	Technological Innovation and Resources
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3186200/ https://www.ncbi.nlm.nih.gov/pubmed/21700957 http://dx.doi.org/10.1074/mcp.M111.008490

_version_	1782213289568108544
author	Griss, Johannes Côté, Richard G. Gerner, Christopher Hermjakob, Henning Vizcaíno, Juan Antonio
author_facet	Griss, Johannes Côté, Richard G. Gerner, Christopher Hermjakob, Henning Vizcaíno, Juan Antonio
author_sort	Griss, Johannes
collection	PubMed
description	In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier. Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data.
format	Online Article Text
id	pubmed-3186200
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	The American Society for Biochemistry and Molecular Biology
record_format	MEDLINE/PubMed
spelling	pubmed-31862002011-11-14 Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data Griss, Johannes Côté, Richard G. Gerner, Christopher Hermjakob, Henning Vizcaíno, Juan Antonio Mol Cell Proteomics Technological Innovation and Resources In proteomics, protein identifications are reported and stored using an unstable reference system: protein identifiers. These proprietary identifiers are created individually by every protein database and can change or may even be deleted over time. To estimate the effect of the searched protein sequence database on the long-term storage of proteomics data we analyzed the changes of reported protein identifiers from all public experiments in the Proteomics Identifications (PRIDE) database by November 2010. To map the submitted protein identifier to a currently active entry, two distinct approaches were used. The first approach used the Protein Identifier Cross Referencing (PICR) service at the EBI, which maps protein identifiers based on 100% sequence identity. The second one (called logical mapping algorithm) accessed the source databases and retrieved the current status of the reported identifier. Our analysis showed the differences between the main protein databases (International Protein Index (IPI), UniProt Knowledgebase (UniProtKB), National Center for Biotechnological Information nr database (NCBI nr), and Ensembl) in respect to identifier stability. For example, whereas 20% of submitted IPI entries were deleted after two years, virtually all UniProtKB entries remained either active or replaced. Furthermore, the two mapping algorithms produced markedly different results. For example, the PICR service reported 10% more IPI entries deleted compared with the logical mapping algorithm. We found several cases where experiments contained more than 10% deleted identifiers already at the time of publication. We also assessed the proportion of peptide identifications in these data sets that still fitted the originally identified protein sequences. Finally, we performed the same overall analysis on all records from IPI, Ensembl, and UniProtKB: two releases per year were used, from 2005. This analysis showed for the first time the true effect of changing protein identifiers on proteomics data. Based on these findings, UniProtKB seems the best database for applications that rely on the long-term storage of proteomics data. The American Society for Biochemistry and Molecular Biology 2011-09 2011-06-23 /pmc/articles/PMC3186200/ /pubmed/21700957 http://dx.doi.org/10.1074/mcp.M111.008490 Text en © 2011 by The American Society for Biochemistry and Molecular Biology, Inc. Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/3.0/) applies to Author Choice Articles
spellingShingle	Technological Innovation and Resources Griss, Johannes Côté, Richard G. Gerner, Christopher Hermjakob, Henning Vizcaíno, Juan Antonio Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data
title	Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data
title_full	Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data
title_fullStr	Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data
title_full_unstemmed	Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data
title_short	Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data
title_sort	published and perished? the influence of the searched protein database on the long-term storage of proteomics data
topic	Technological Innovation and Resources
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3186200/ https://www.ncbi.nlm.nih.gov/pubmed/21700957 http://dx.doi.org/10.1074/mcp.M111.008490
work_keys_str_mv	AT grissjohannes publishedandperishedtheinfluenceofthesearchedproteindatabaseonthelongtermstorageofproteomicsdata AT coterichardg publishedandperishedtheinfluenceofthesearchedproteindatabaseonthelongtermstorageofproteomicsdata AT gernerchristopher publishedandperishedtheinfluenceofthesearchedproteindatabaseonthelongtermstorageofproteomicsdata AT hermjakobhenning publishedandperishedtheinfluenceofthesearchedproteindatabaseonthelongtermstorageofproteomicsdata AT vizcainojuanantonio publishedandperishedtheinfluenceofthesearchedproteindatabaseonthelongtermstorageofproteomicsdata

Published and Perished? The Influence of the Searched Protein Database on the Long-Term Storage of Proteomics Data

Ejemplares similares