Cargando…

Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone

BACKGROUND: Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary str...

Descripción completa

Detalles Bibliográficos
Autores principales: Ruiz-Blanco, Yasser B., Agüero-Chapin, Guillermin, García-Hernández, Enrique, Álvarez, Orlando, Antunes, Agostinho, Green, James
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5521120/
https://www.ncbi.nlm.nih.gov/pubmed/28732462
http://dx.doi.org/10.1186/s12859-017-1758-x
_version_ 1783251920879616000
author Ruiz-Blanco, Yasser B.
Agüero-Chapin, Guillermin
García-Hernández, Enrique
Álvarez, Orlando
Antunes, Agostinho
Green, James
author_facet Ruiz-Blanco, Yasser B.
Agüero-Chapin, Guillermin
García-Hernández, Enrique
Álvarez, Orlando
Antunes, Agostinho
Green, James
author_sort Ruiz-Blanco, Yasser B.
collection PubMed
description BACKGROUND: Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D–structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset. RESULTS: Four protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D–structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal’s descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D–structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome. CONCLUSIONS: The applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1758-x) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-5521120
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-55211202017-07-21 Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone Ruiz-Blanco, Yasser B. Agüero-Chapin, Guillermin García-Hernández, Enrique Álvarez, Orlando Antunes, Agostinho Green, James BMC Bioinformatics Research Article BACKGROUND: Computational prediction of protein function constitutes one of the more complex problems in Bioinformatics, because of the diversity of functions and mechanisms in that proteins exert in nature. This issue is reinforced especially for proteins that share very low primary or tertiary structure similarity to existing annotated proteomes. In this sense, new alignment-free (AF) tools are needed to overcome the inherent limitations of classic alignment-based approaches to this issue. We have recently introduced AF protein-numerical-encoding programs (TI2BioP and ProtDCal), whose sequence-based features have been successfully applied to detect remote protein homologs, post-translational modifications and antibacterial peptides. Here we aim to demonstrate the applicability of 4 AF protein descriptor families, implemented in our programs, for the identification enzyme-like proteins. At the same time, the use of our novel family of 3D–structure-based descriptors is introduced for the first time. The Dobson & Doig (D&D) benchmark dataset is used for the evaluation of our AF protein descriptors, because of its proven structural diversity that permits one to emulate an experiment within the twilight zone of alignment-based methods (pair-wise identity <30%). The performance of our sequence-based predictor was further assessed using a subset of formerly uncharacterized proteins which currently represent a benchmark annotation dataset. RESULTS: Four protein descriptor families (sequence-composition-based (0D), linear-topology-based (1D), pseudo-fold-topology-based (2D) and 3D–structure features (3D), were assessed using the D&D benchmark dataset. We show that only the families of ProtDCal’s descriptors (0D, 1D and 3D) encode significant information for enzymes and non-enzymes discrimination. The obtained 3D–structure-based classifier ranked first among several other SVM-based methods assessed in this dataset. Furthermore, the model leveraging 1D descriptors, showed a higher success rate than EzyPred on a benchmark annotation dataset from the Shewanella oneidensis proteome. CONCLUSIONS: The applicability of ProtDCal as a general-purpose-AF protein modelling method is illustrated through the discrimination between two comprehensive protein functional classes. The observed performances using the highly diverse D&D dataset, and the set of formerly uncharacterized (hard-to-annotate) proteins of Shewanella oneidensis, places our methodology on the top range of methods to model and predict protein function using alignment-free approaches. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-017-1758-x) contains supplementary material, which is available to authorized users. BioMed Central 2017-07-21 /pmc/articles/PMC5521120/ /pubmed/28732462 http://dx.doi.org/10.1186/s12859-017-1758-x Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research Article
Ruiz-Blanco, Yasser B.
Agüero-Chapin, Guillermin
García-Hernández, Enrique
Álvarez, Orlando
Antunes, Agostinho
Green, James
Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_full Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_fullStr Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_full_unstemmed Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_short Exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
title_sort exploring general-purpose protein features for distinguishing enzymes and non-enzymes within the twilight zone
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5521120/
https://www.ncbi.nlm.nih.gov/pubmed/28732462
http://dx.doi.org/10.1186/s12859-017-1758-x
work_keys_str_mv AT ruizblancoyasserb exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT aguerochapinguillermin exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT garciahernandezenrique exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT alvarezorlando exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT antunesagostinho exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone
AT greenjames exploringgeneralpurposeproteinfeaturesfordistinguishingenzymesandnonenzymeswithinthetwilightzone