Cargando…

Identification and correction of abnormal, incomplete and mispredicted proteins in public databases

BACKGROUND: Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these...

Descripción completa

Detalles Bibliográficos
Autores principales: Nagy, Alinda, Hegyi, Hédi, Farkas, Krisztina, Tordai, Hedvig, Kozma, Evelin, Bányai, László, Patthy, László
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2542381/
https://www.ncbi.nlm.nih.gov/pubmed/18752676
http://dx.doi.org/10.1186/1471-2105-9-353
_version_ 1782159144702181376
author Nagy, Alinda
Hegyi, Hédi
Farkas, Krisztina
Tordai, Hedvig
Kozma, Evelin
Bányai, László
Patthy, László
author_facet Nagy, Alinda
Hegyi, Hédi
Farkas, Krisztina
Tordai, Hedvig
Kozma, Evelin
Bányai, László
Patthy, László
author_sort Nagy, Alinda
collection PubMed
description BACKGROUND: Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes. RESULTS: Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries. CONCLUSION: MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases.
format Text
id pubmed-2542381
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-25423812008-09-18 Identification and correction of abnormal, incomplete and mispredicted proteins in public databases Nagy, Alinda Hegyi, Hédi Farkas, Krisztina Tordai, Hedvig Kozma, Evelin Bányai, László Patthy, László BMC Bioinformatics Research Article BACKGROUND: Despite significant improvements in computational annotation of genomes, sequences of abnormal, incomplete or incorrectly predicted genes and proteins remain abundant in public databases. Since the majority of incomplete, abnormal or mispredicted entries are not annotated as such, these errors seriously affect the reliability of these databases. Here we describe the MisPred approach that may provide an efficient means for the quality control of databases. The current version of the MisPred approach uses five distinct routines for identifying abnormal, incomplete or mispredicted entries based on the principle that a sequence is likely to be incorrect if some of its features conflict with our current knowledge about protein-coding genes and proteins: (i) conflict between the predicted subcellular localization of proteins and the absence of the corresponding sequence signals; (ii) presence of extracellular and cytoplasmic domains and the absence of transmembrane segments; (iii) co-occurrence of extracellular and nuclear domains; (iv) violation of domain integrity; (v) chimeras encoded by two or more genes located on different chromosomes. RESULTS: Analyses of predicted EnsEMBL protein sequences of nine deuterostome (Homo sapiens, Mus musculus, Rattus norvegicus, Monodelphis domestica, Gallus gallus, Xenopus tropicalis, Fugu rubripes, Danio rerio and Ciona intestinalis) and two protostome species (Caenorhabditis elegans and Drosophila melanogaster) have revealed that the absence of expected signal peptides and violation of domain integrity account for the majority of mispredictions. Analyses of sequences predicted by NCBI's GNOMON annotation pipeline show that the rates of mispredictions are comparable to those of EnsEMBL. Interestingly, even the manually curated UniProtKB/Swiss-Prot dataset is contaminated with mispredicted or abnormal proteins, although to a much lesser extent than UniProtKB/TrEMBL or the EnsEMBL or GNOMON-predicted entries. CONCLUSION: MisPred works efficiently in identifying errors in predictions generated by the most reliable gene prediction tools such as the EnsEMBL and NCBI's GNOMON pipelines and also guides the correction of errors. We suggest that application of the MisPred approach will significantly improve the quality of gene predictions and the associated databases. BioMed Central 2008-08-27 /pmc/articles/PMC2542381/ /pubmed/18752676 http://dx.doi.org/10.1186/1471-2105-9-353 Text en Copyright © 2008 Nagy et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Nagy, Alinda
Hegyi, Hédi
Farkas, Krisztina
Tordai, Hedvig
Kozma, Evelin
Bányai, László
Patthy, László
Identification and correction of abnormal, incomplete and mispredicted proteins in public databases
title Identification and correction of abnormal, incomplete and mispredicted proteins in public databases
title_full Identification and correction of abnormal, incomplete and mispredicted proteins in public databases
title_fullStr Identification and correction of abnormal, incomplete and mispredicted proteins in public databases
title_full_unstemmed Identification and correction of abnormal, incomplete and mispredicted proteins in public databases
title_short Identification and correction of abnormal, incomplete and mispredicted proteins in public databases
title_sort identification and correction of abnormal, incomplete and mispredicted proteins in public databases
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2542381/
https://www.ncbi.nlm.nih.gov/pubmed/18752676
http://dx.doi.org/10.1186/1471-2105-9-353
work_keys_str_mv AT nagyalinda identificationandcorrectionofabnormalincompleteandmispredictedproteinsinpublicdatabases
AT hegyihedi identificationandcorrectionofabnormalincompleteandmispredictedproteinsinpublicdatabases
AT farkaskrisztina identificationandcorrectionofabnormalincompleteandmispredictedproteinsinpublicdatabases
AT tordaihedvig identificationandcorrectionofabnormalincompleteandmispredictedproteinsinpublicdatabases
AT kozmaevelin identificationandcorrectionofabnormalincompleteandmispredictedproteinsinpublicdatabases
AT banyailaszlo identificationandcorrectionofabnormalincompleteandmispredictedproteinsinpublicdatabases
AT patthylaszlo identificationandcorrectionofabnormalincompleteandmispredictedproteinsinpublicdatabases