Cargando…

Minimizing proteome redundancy in the UniProt Knowledgebase

Advances in high-throughput sequencing have led to an unprecedented growth in genome sequences being submitted to biological databases. In particular, the sequencing of large numbers of nearly identical bacterial genomes during infection outbreaks and for other large-scale studies has resulted in a...

Descripción completa

Detalles Bibliográficos
Autores principales: Bursteinas, Borisas, Britto, Ramona, Bely, Benoit, Auchincloss, Andrea, Rivoire, Catherine, Redaschi, Nicole, O'Donovan, Claire, Martin, Maria Jesus
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5199198/
https://www.ncbi.nlm.nih.gov/pubmed/28025334
http://dx.doi.org/10.1093/database/baw139
_version_ 1782488965949947904
author Bursteinas, Borisas
Britto, Ramona
Bely, Benoit
Auchincloss, Andrea
Rivoire, Catherine
Redaschi, Nicole
O'Donovan, Claire
Martin, Maria Jesus
author_facet Bursteinas, Borisas
Britto, Ramona
Bely, Benoit
Auchincloss, Andrea
Rivoire, Catherine
Redaschi, Nicole
O'Donovan, Claire
Martin, Maria Jesus
author_sort Bursteinas, Borisas
collection PubMed
description Advances in high-throughput sequencing have led to an unprecedented growth in genome sequences being submitted to biological databases. In particular, the sequencing of large numbers of nearly identical bacterial genomes during infection outbreaks and for other large-scale studies has resulted in a high level of redundancy in nucleotide databases and consequently in the UniProt Knowledgebase (UniProtKB). Redundancy negatively impacts on database searches by causing slower searches, an increase in statistical bias and cumbersome result analysis. The redundancy combined with the large data volume increases the computational costs for most reuses of UniProtKB data. All of this poses challenges for effective discovery in this wealth of data. With the continuing development of sequencing technologies, it is clear that finding ways to minimize redundancy is crucial to maintaining UniProt's essential contribution to data interpretation by our users. We have developed a methodology to identify and remove highly redundant proteomes from UniProtKB. The procedure identifies redundant proteomes by performing pairwise alignments of sets of sequences for pairs of proteomes and subsequently, applies graph theory to find dominating sets that provide a set of non-redundant proteomes with a minimal loss of information. This method was implemented for bacteria in mid-2015, resulting in a removal of 50 million proteins in UniProtKB. With every new release, this procedure is used to filter new incoming proteomes, resulting in a more scalable and scientifically valuable growth of UniProtKB. Database URL: http://www.uniprot.org/proteomes/
format Online
Article
Text
id pubmed-5199198
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-51991982017-01-06 Minimizing proteome redundancy in the UniProt Knowledgebase Bursteinas, Borisas Britto, Ramona Bely, Benoit Auchincloss, Andrea Rivoire, Catherine Redaschi, Nicole O'Donovan, Claire Martin, Maria Jesus Database (Oxford) Original Article Advances in high-throughput sequencing have led to an unprecedented growth in genome sequences being submitted to biological databases. In particular, the sequencing of large numbers of nearly identical bacterial genomes during infection outbreaks and for other large-scale studies has resulted in a high level of redundancy in nucleotide databases and consequently in the UniProt Knowledgebase (UniProtKB). Redundancy negatively impacts on database searches by causing slower searches, an increase in statistical bias and cumbersome result analysis. The redundancy combined with the large data volume increases the computational costs for most reuses of UniProtKB data. All of this poses challenges for effective discovery in this wealth of data. With the continuing development of sequencing technologies, it is clear that finding ways to minimize redundancy is crucial to maintaining UniProt's essential contribution to data interpretation by our users. We have developed a methodology to identify and remove highly redundant proteomes from UniProtKB. The procedure identifies redundant proteomes by performing pairwise alignments of sets of sequences for pairs of proteomes and subsequently, applies graph theory to find dominating sets that provide a set of non-redundant proteomes with a minimal loss of information. This method was implemented for bacteria in mid-2015, resulting in a removal of 50 million proteins in UniProtKB. With every new release, this procedure is used to filter new incoming proteomes, resulting in a more scalable and scientifically valuable growth of UniProtKB. Database URL: http://www.uniprot.org/proteomes/ Oxford University Press 2016-12-26 /pmc/articles/PMC5199198/ /pubmed/28025334 http://dx.doi.org/10.1093/database/baw139 Text en © The Author(s) 2016. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Bursteinas, Borisas
Britto, Ramona
Bely, Benoit
Auchincloss, Andrea
Rivoire, Catherine
Redaschi, Nicole
O'Donovan, Claire
Martin, Maria Jesus
Minimizing proteome redundancy in the UniProt Knowledgebase
title Minimizing proteome redundancy in the UniProt Knowledgebase
title_full Minimizing proteome redundancy in the UniProt Knowledgebase
title_fullStr Minimizing proteome redundancy in the UniProt Knowledgebase
title_full_unstemmed Minimizing proteome redundancy in the UniProt Knowledgebase
title_short Minimizing proteome redundancy in the UniProt Knowledgebase
title_sort minimizing proteome redundancy in the uniprot knowledgebase
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5199198/
https://www.ncbi.nlm.nih.gov/pubmed/28025334
http://dx.doi.org/10.1093/database/baw139
work_keys_str_mv AT bursteinasborisas minimizingproteomeredundancyintheuniprotknowledgebase
AT brittoramona minimizingproteomeredundancyintheuniprotknowledgebase
AT belybenoit minimizingproteomeredundancyintheuniprotknowledgebase
AT auchinclossandrea minimizingproteomeredundancyintheuniprotknowledgebase
AT rivoirecatherine minimizingproteomeredundancyintheuniprotknowledgebase
AT redaschinicole minimizingproteomeredundancyintheuniprotknowledgebase
AT odonovanclaire minimizingproteomeredundancyintheuniprotknowledgebase
AT martinmariajesus minimizingproteomeredundancyintheuniprotknowledgebase