Cargando…

Matching curated genome databases: a non trivial task

BACKGROUND: Curated databases of completely sequenced genomes have been designed independently at the NCBI (RefSeq) and EBI (Genome Reviews) to cope with non-standard annotation found in the version of the sequenced genome that has been published by databanks GenBank/EMBL/DDBJ. These curation attemp...

Descripción completa

Detalles Bibliográficos
Autores principales: Descorps-Declère, Stéphane, Barba, Matthieu, Labedan, Bernard
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2596144/
https://www.ncbi.nlm.nih.gov/pubmed/18950477
http://dx.doi.org/10.1186/1471-2164-9-501
_version_ 1782161829140627456
author Descorps-Declère, Stéphane
Barba, Matthieu
Labedan, Bernard
author_facet Descorps-Declère, Stéphane
Barba, Matthieu
Labedan, Bernard
author_sort Descorps-Declère, Stéphane
collection PubMed
description BACKGROUND: Curated databases of completely sequenced genomes have been designed independently at the NCBI (RefSeq) and EBI (Genome Reviews) to cope with non-standard annotation found in the version of the sequenced genome that has been published by databanks GenBank/EMBL/DDBJ. These curation attempts were expected to review the annotations and to improve their pertinence when using them to annotate newly released genome sequences by homology to previously annotated genomes. However, we observed that such an uncoordinated effort has two unwanted consequences. First, it is not trivial to map the protein identifiers of the same sequence in both databases. Secondly, the two reannotated versions of the same genome differ at the level of their structural annotation. RESULTS: Here, we propose CorBank, a program devised to provide cross-referencing protein identifiers no matter what the level of identity is found between their matching sequences. Approximately 98% of the 1,983,258 amino acid sequences are matching, allowing instantaneous retrieval of their respective cross-references. CorBank further allows detecting any differences between the independently curated versions of the same genome. We found that the RefSeq and Genome Reviews versions are perfectly matching for only 50 of the 641 complete genomes we have analyzed. In all other cases there are differences occurring at the level of the coding sequence (CDS), and/or in the total number of CDS in the respective version of the same genome. CorBank is freely accessible at . The CorBank site contains also updated publication of the exhaustive results obtained by comparing RefSeq and Genome Reviews versions of each genome. Accordingly, this web site allows easy search of cross-references between RefSeq, Genome Reviews, and UniProt, for either a single CDS or a whole replicon. CONCLUSION: CorBank is very efficient in rapid detection of the numerous differences existing between RefSeq and Genome Reviews versions of the same curated genome. Although such differences are acceptable as reflecting different views, we suggest that curators of both genome databases could help reducing further divergence by agreeing on a minimal dialogue and attempting to publish the point of view of the other database whenever it is technically possible.
format Text
id pubmed-2596144
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-25961442008-12-05 Matching curated genome databases: a non trivial task Descorps-Declère, Stéphane Barba, Matthieu Labedan, Bernard BMC Genomics Research Article BACKGROUND: Curated databases of completely sequenced genomes have been designed independently at the NCBI (RefSeq) and EBI (Genome Reviews) to cope with non-standard annotation found in the version of the sequenced genome that has been published by databanks GenBank/EMBL/DDBJ. These curation attempts were expected to review the annotations and to improve their pertinence when using them to annotate newly released genome sequences by homology to previously annotated genomes. However, we observed that such an uncoordinated effort has two unwanted consequences. First, it is not trivial to map the protein identifiers of the same sequence in both databases. Secondly, the two reannotated versions of the same genome differ at the level of their structural annotation. RESULTS: Here, we propose CorBank, a program devised to provide cross-referencing protein identifiers no matter what the level of identity is found between their matching sequences. Approximately 98% of the 1,983,258 amino acid sequences are matching, allowing instantaneous retrieval of their respective cross-references. CorBank further allows detecting any differences between the independently curated versions of the same genome. We found that the RefSeq and Genome Reviews versions are perfectly matching for only 50 of the 641 complete genomes we have analyzed. In all other cases there are differences occurring at the level of the coding sequence (CDS), and/or in the total number of CDS in the respective version of the same genome. CorBank is freely accessible at . The CorBank site contains also updated publication of the exhaustive results obtained by comparing RefSeq and Genome Reviews versions of each genome. Accordingly, this web site allows easy search of cross-references between RefSeq, Genome Reviews, and UniProt, for either a single CDS or a whole replicon. CONCLUSION: CorBank is very efficient in rapid detection of the numerous differences existing between RefSeq and Genome Reviews versions of the same curated genome. Although such differences are acceptable as reflecting different views, we suggest that curators of both genome databases could help reducing further divergence by agreeing on a minimal dialogue and attempting to publish the point of view of the other database whenever it is technically possible. BioMed Central 2008-10-24 /pmc/articles/PMC2596144/ /pubmed/18950477 http://dx.doi.org/10.1186/1471-2164-9-501 Text en Copyright © 2008 Descorps-Declère et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Descorps-Declère, Stéphane
Barba, Matthieu
Labedan, Bernard
Matching curated genome databases: a non trivial task
title Matching curated genome databases: a non trivial task
title_full Matching curated genome databases: a non trivial task
title_fullStr Matching curated genome databases: a non trivial task
title_full_unstemmed Matching curated genome databases: a non trivial task
title_short Matching curated genome databases: a non trivial task
title_sort matching curated genome databases: a non trivial task
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2596144/
https://www.ncbi.nlm.nih.gov/pubmed/18950477
http://dx.doi.org/10.1186/1471-2164-9-501
work_keys_str_mv AT descorpsdeclerestephane matchingcuratedgenomedatabasesanontrivialtask
AT barbamatthieu matchingcuratedgenomedatabasesanontrivialtask
AT labedanbernard matchingcuratedgenomedatabasesanontrivialtask