Cargando…
ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins
BACKGROUND: The exponential growth of protein structural and sequence databases is enabling multifaceted approaches to understanding the long sought sequence-structure-function relationship. Advances in computation now make it possible to apply well-established data mining and pattern recognition te...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4432771/ https://www.ncbi.nlm.nih.gov/pubmed/25982853 http://dx.doi.org/10.1186/s12859-015-0586-0 |
_version_ | 1782371532285149184 |
---|---|
author | Ruiz-Blanco, Yasser B Paz, Waldo Green, James Marrero-Ponce, Yovani |
author_facet | Ruiz-Blanco, Yasser B Paz, Waldo Green, James Marrero-Ponce, Yovani |
author_sort | Ruiz-Blanco, Yasser B |
collection | PubMed |
description | BACKGROUND: The exponential growth of protein structural and sequence databases is enabling multifaceted approaches to understanding the long sought sequence-structure-function relationship. Advances in computation now make it possible to apply well-established data mining and pattern recognition techniques to these data to learn models that effectively relate structure and function. However, extracting meaningful numerical descriptors of protein sequence and structure is a key issue that requires an efficient and widely available solution. RESULTS: We here introduce ProtDCal, a new computational software suite capable of generating tens of thousands of features considering both sequence-based and 3D-structural descriptors. We demonstrate, by means of principle component analysis and Shannon entropy tests, how ProtDCal’s sequence-based descriptors provide new and more relevant information not encoded by currently available servers for sequence-based protein feature generation. The wide diversity of the 3D-structure-based features generated by ProtDCal is shown to provide additional complementary information and effectively completes its general protein encoding capability. As demonstration of the utility of ProtDCal’s features, prediction models of N-linked glycosylation sites are trained and evaluated. Classification performance compares favourably with that of contemporary predictors of N-linked glycosylation sites, in spite of not using domain-specific features as input information. CONCLUSIONS: ProtDCal provides a friendly and cross-platform graphical user interface, developed in the Java programming language and is freely available at: http://bioinf.sce.carleton.ca/ProtDCal/. ProtDCal introduces local and group-based encoding which enhances the diversity of the information captured by the computed features. Furthermore, we have shown that adding structure-based descriptors contributes non-redundant additional information to the features-based characterization of polypeptide systems. This software is intended to provide a useful tool for general-purpose encoding of protein sequences and structures for applications is protein classification, similarity analyses and function prediction. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0586-0) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-4432771 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-44327712015-05-16 ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins Ruiz-Blanco, Yasser B Paz, Waldo Green, James Marrero-Ponce, Yovani BMC Bioinformatics Software BACKGROUND: The exponential growth of protein structural and sequence databases is enabling multifaceted approaches to understanding the long sought sequence-structure-function relationship. Advances in computation now make it possible to apply well-established data mining and pattern recognition techniques to these data to learn models that effectively relate structure and function. However, extracting meaningful numerical descriptors of protein sequence and structure is a key issue that requires an efficient and widely available solution. RESULTS: We here introduce ProtDCal, a new computational software suite capable of generating tens of thousands of features considering both sequence-based and 3D-structural descriptors. We demonstrate, by means of principle component analysis and Shannon entropy tests, how ProtDCal’s sequence-based descriptors provide new and more relevant information not encoded by currently available servers for sequence-based protein feature generation. The wide diversity of the 3D-structure-based features generated by ProtDCal is shown to provide additional complementary information and effectively completes its general protein encoding capability. As demonstration of the utility of ProtDCal’s features, prediction models of N-linked glycosylation sites are trained and evaluated. Classification performance compares favourably with that of contemporary predictors of N-linked glycosylation sites, in spite of not using domain-specific features as input information. CONCLUSIONS: ProtDCal provides a friendly and cross-platform graphical user interface, developed in the Java programming language and is freely available at: http://bioinf.sce.carleton.ca/ProtDCal/. ProtDCal introduces local and group-based encoding which enhances the diversity of the information captured by the computed features. Furthermore, we have shown that adding structure-based descriptors contributes non-redundant additional information to the features-based characterization of polypeptide systems. This software is intended to provide a useful tool for general-purpose encoding of protein sequences and structures for applications is protein classification, similarity analyses and function prediction. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0586-0) contains supplementary material, which is available to authorized users. BioMed Central 2015-05-16 /pmc/articles/PMC4432771/ /pubmed/25982853 http://dx.doi.org/10.1186/s12859-015-0586-0 Text en © Ruiz-Blanco et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Software Ruiz-Blanco, Yasser B Paz, Waldo Green, James Marrero-Ponce, Yovani ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins |
title | ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins |
title_full | ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins |
title_fullStr | ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins |
title_full_unstemmed | ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins |
title_short | ProtDCal: A program to compute general-purpose-numerical descriptors for sequences and 3D-structures of proteins |
title_sort | protdcal: a program to compute general-purpose-numerical descriptors for sequences and 3d-structures of proteins |
topic | Software |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4432771/ https://www.ncbi.nlm.nih.gov/pubmed/25982853 http://dx.doi.org/10.1186/s12859-015-0586-0 |
work_keys_str_mv | AT ruizblancoyasserb protdcalaprogramtocomputegeneralpurposenumericaldescriptorsforsequencesand3dstructuresofproteins AT pazwaldo protdcalaprogramtocomputegeneralpurposenumericaldescriptorsforsequencesand3dstructuresofproteins AT greenjames protdcalaprogramtocomputegeneralpurposenumericaldescriptorsforsequencesand3dstructuresofproteins AT marreroponceyovani protdcalaprogramtocomputegeneralpurposenumericaldescriptorsforsequencesand3dstructuresofproteins |