Cargando…

Mining protein loops using a structural alphabet and statistical exceptionality

BACKGROUND: Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary...

Descripción completa

Detalles Bibliográficos
Autores principales: Regad, Leslie, Martin, Juliette, Nuel, Gregory, Camproux, Anne-Claude
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2833150/
https://www.ncbi.nlm.nih.gov/pubmed/20132552
http://dx.doi.org/10.1186/1471-2105-11-75
_version_ 1782178359767203840
author Regad, Leslie
Martin, Juliette
Nuel, Gregory
Camproux, Anne-Claude
author_facet Regad, Leslie
Martin, Juliette
Nuel, Gregory
Camproux, Anne-Claude
author_sort Regad, Leslie
collection PubMed
description BACKGROUND: Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied. RESULTS: We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints. CONCLUSIONS: We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/.
format Text
id pubmed-2833150
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28331502010-03-06 Mining protein loops using a structural alphabet and statistical exceptionality Regad, Leslie Martin, Juliette Nuel, Gregory Camproux, Anne-Claude BMC Bioinformatics Research article BACKGROUND: Protein loops encompass 50% of protein residues in available three-dimensional structures. These regions are often involved in protein functions, e.g. binding site, catalytic pocket... However, the description of protein loops with conventional tools is an uneasy task. Regular secondary structures, helices and strands, have been widely studied whereas loops, because they are highly variable in terms of sequence and structure, are difficult to analyze. Due to data sparsity, long loops have rarely been systematically studied. RESULTS: We developed a simple and accurate method that allows the description and analysis of the structures of short and long loops using structural motifs without restriction on loop length. This method is based on the structural alphabet HMM-SA. HMM-SA allows the simplification of a three-dimensional protein structure into a one-dimensional string of states, where each state is a four-residue prototype fragment, called structural letter. The difficult task of the structural grouping of huge data sets is thus easily accomplished by handling structural letter strings as in conventional protein sequence analysis. We systematically extracted all seven-residue fragments in a bank of 93000 protein loops and grouped them according to the structural-letter sequence, named structural word. This approach permits a systematic analysis of loops of all sizes since we consider the structural motifs of seven residues rather than complete loops. We focused the analysis on highly recurrent words of loops (observed more than 30 times). Our study reveals that 73% of loop-lengths are covered by only 3310 highly recurrent structural words out of 28274 observed words). These structural words have low structural variability (mean RMSd of 0.85 Å). As expected, half of these motifs display a flanking-region preference but interestingly, two thirds are shared by short (less than 12 residues) and long loops. Moreover, half of recurrent motifs exhibit a significant level of amino-acid conservation with at least four significant positions and 87% of long loops contain at least one such word. We complement our analysis with the detection of statistically over-represented patterns of structural letters as in conventional DNA sequence analysis. About 30% (930) of structural words are over-represented, and cover about 40% of loop lengths. Interestingly, these words exhibit lower structural variability and higher sequential specificity, suggesting structural or functional constraints. CONCLUSIONS: We developed a method to systematically decompose and study protein loops using recurrent structural motifs. This method is based on the structural alphabet HMM-SA and not on structural alignment and geometrical parameters. We extracted meaningful structural motifs that are found in both short and long loops. To our knowledge, it is the first time that pattern mining helps to increase the signal-to-noise ratio in protein loops. This finding helps to better describe protein loops and might permit to decrease the complexity of long-loop analysis. Detailed results are available at http://www.mti.univ-paris-diderot.fr/publication/supplementary/2009/ACCLoop/. BioMed Central 2010-02-04 /pmc/articles/PMC2833150/ /pubmed/20132552 http://dx.doi.org/10.1186/1471-2105-11-75 Text en Copyright ©2010 Regad et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Regad, Leslie
Martin, Juliette
Nuel, Gregory
Camproux, Anne-Claude
Mining protein loops using a structural alphabet and statistical exceptionality
title Mining protein loops using a structural alphabet and statistical exceptionality
title_full Mining protein loops using a structural alphabet and statistical exceptionality
title_fullStr Mining protein loops using a structural alphabet and statistical exceptionality
title_full_unstemmed Mining protein loops using a structural alphabet and statistical exceptionality
title_short Mining protein loops using a structural alphabet and statistical exceptionality
title_sort mining protein loops using a structural alphabet and statistical exceptionality
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2833150/
https://www.ncbi.nlm.nih.gov/pubmed/20132552
http://dx.doi.org/10.1186/1471-2105-11-75
work_keys_str_mv AT regadleslie miningproteinloopsusingastructuralalphabetandstatisticalexceptionality
AT martinjuliette miningproteinloopsusingastructuralalphabetandstatisticalexceptionality
AT nuelgregory miningproteinloopsusingastructuralalphabetandstatisticalexceptionality
AT camprouxanneclaude miningproteinloopsusingastructuralalphabetandstatisticalexceptionality