Cargando…

A stochastic context free grammar based framework for analysis of protein sequences

BACKGROUND: In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acid...

Descripción completa

Detalles Bibliográficos
Autores principales:	Dyrka, Witold, Nebel, Jean-Christophe
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2009
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2765975/ https://www.ncbi.nlm.nih.gov/pubmed/19814800 http://dx.doi.org/10.1186/1471-2105-10-323

_version_	1782173185474560000
author	Dyrka, Witold Nebel, Jean-Christophe
author_facet	Dyrka, Witold Nebel, Jean-Christophe
author_sort	Dyrka, Witold
collection	PubMed
description	BACKGROUND: In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited the application of formal language theory to the production of grammars whose expressive power is not higher than stochastic regular grammars. However, these grammars, like other state of the art methods, cannot cover any higher-order dependencies such as nested and crossing relationships that are common in proteins. In order to overcome some of these limitations, we propose a Stochastic Context Free Grammar based framework for the analysis of protein sequences where grammars are induced using a genetic algorithm. RESULTS: This framework was implemented in a system aiming at the production of binding site descriptors. These descriptors not only allow detection of protein regions that are involved in these sites, but also provide insight in their structure. Grammars were induced using quantitative properties of amino acids to deal with the size of the protein alphabet. Moreover, we imposed some structural constraints on grammars to reduce the extent of the rule search space. Finally, grammars based on different properties were combined to convey as much information as possible. Evaluation was performed on sites of various sizes and complexity described either by PROSITE patterns, domain profiles or a set of patterns. Results show the produced binding site descriptors are human-readable and, hence, highlight biologically meaningful features. Moreover, they achieve good accuracy in both annotation and detection. In addition, findings suggest that, unlike current state-of-the-art methods, our system may be particularly suited to deal with patterns shared by non-homologous proteins. CONCLUSION: A new Stochastic Context Free Grammar based framework has been introduced allowing the production of binding site descriptors for analysis of protein sequences. Experiments have shown that not only is this new approach valid, but produces human-readable descriptors for binding sites which have been beyond the capability of current machine learning techniques.
format	Text
id	pubmed-2765975
institution	National Center for Biotechnology Information
language	English
publishDate	2009
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-27659752009-10-23 A stochastic context free grammar based framework for analysis of protein sequences Dyrka, Witold Nebel, Jean-Christophe BMC Bioinformatics Research Article BACKGROUND: In the last decade, there have been many applications of formal language theory in bioinformatics such as RNA structure prediction and detection of patterns in DNA. However, in the field of proteomics, the size of the protein alphabet and the complexity of relationship between amino acids have mainly limited the application of formal language theory to the production of grammars whose expressive power is not higher than stochastic regular grammars. However, these grammars, like other state of the art methods, cannot cover any higher-order dependencies such as nested and crossing relationships that are common in proteins. In order to overcome some of these limitations, we propose a Stochastic Context Free Grammar based framework for the analysis of protein sequences where grammars are induced using a genetic algorithm. RESULTS: This framework was implemented in a system aiming at the production of binding site descriptors. These descriptors not only allow detection of protein regions that are involved in these sites, but also provide insight in their structure. Grammars were induced using quantitative properties of amino acids to deal with the size of the protein alphabet. Moreover, we imposed some structural constraints on grammars to reduce the extent of the rule search space. Finally, grammars based on different properties were combined to convey as much information as possible. Evaluation was performed on sites of various sizes and complexity described either by PROSITE patterns, domain profiles or a set of patterns. Results show the produced binding site descriptors are human-readable and, hence, highlight biologically meaningful features. Moreover, they achieve good accuracy in both annotation and detection. In addition, findings suggest that, unlike current state-of-the-art methods, our system may be particularly suited to deal with patterns shared by non-homologous proteins. CONCLUSION: A new Stochastic Context Free Grammar based framework has been introduced allowing the production of binding site descriptors for analysis of protein sequences. Experiments have shown that not only is this new approach valid, but produces human-readable descriptors for binding sites which have been beyond the capability of current machine learning techniques. BioMed Central 2009-10-08 /pmc/articles/PMC2765975/ /pubmed/19814800 http://dx.doi.org/10.1186/1471-2105-10-323 Text en Copyright © 2009 Dyrka and Nebel; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Dyrka, Witold Nebel, Jean-Christophe A stochastic context free grammar based framework for analysis of protein sequences
title	A stochastic context free grammar based framework for analysis of protein sequences
title_full	A stochastic context free grammar based framework for analysis of protein sequences
title_fullStr	A stochastic context free grammar based framework for analysis of protein sequences
title_full_unstemmed	A stochastic context free grammar based framework for analysis of protein sequences
title_short	A stochastic context free grammar based framework for analysis of protein sequences
title_sort	stochastic context free grammar based framework for analysis of protein sequences
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2765975/ https://www.ncbi.nlm.nih.gov/pubmed/19814800 http://dx.doi.org/10.1186/1471-2105-10-323
work_keys_str_mv	AT dyrkawitold astochasticcontextfreegrammarbasedframeworkforanalysisofproteinsequences AT nebeljeanchristophe astochasticcontextfreegrammarbasedframeworkforanalysisofproteinsequences AT dyrkawitold stochasticcontextfreegrammarbasedframeworkforanalysisofproteinsequences AT nebeljeanchristophe stochasticcontextfreegrammarbasedframeworkforanalysisofproteinsequences

A stochastic context free grammar based framework for analysis of protein sequences

Ejemplares similares