Cargando…
Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank
Elucidating the principles of sequence–structure relationships of proteins is a long-standing issue in biology. The nature of a short segment of a protein is determined by both the subsequence of the segment itself and its environment. For example, a type of subsequence, the so-called chameleon sequ...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
The Biophysical Society of Japan
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8926306/ https://www.ncbi.nlm.nih.gov/pubmed/35532457 http://dx.doi.org/10.2142/biophysico.bppb-v19.0002 |
_version_ | 1784670213473566720 |
---|---|
author | Kondo, Ryohei Kasahara, Kota Takahashi, Takuya |
author_facet | Kondo, Ryohei Kasahara, Kota Takahashi, Takuya |
author_sort | Kondo, Ryohei |
collection | PubMed |
description | Elucidating the principles of sequence–structure relationships of proteins is a long-standing issue in biology. The nature of a short segment of a protein is determined by both the subsequence of the segment itself and its environment. For example, a type of subsequence, the so-called chameleon sequences, can form different secondary structures depending on its environments. Chameleon sequences are considered to have a weak tendency to form a specific structure. Although many chameleon sequences have been identified, they are only a small part of all possible subsequences in the proteome. The strength of the tendency to take a specific structure for each subsequence has not been fully quantified. In this study, we comprehensively analyzed subsequences consisting of four to nine amino acid residues, or N-gram (4≤N≤9), observed in non-redundant sequences in the Protein Data Bank (PDB). Tendencies to form a specific structure in terms of the secondary structure and accessible surface area are quantified as information quantities for each N-gram. Although the majority of observed subsequences have low information quantity due to lack of samples in the current PDB, thousands of N-grams with strong tendencies, including known structural motifs, were found. In addition, machine learning partially predicted the tendency of unknown N-grams, and thus, this technique helps to extract knowledge from the limited number of samples in the PDB. |
format | Online Article Text |
id | pubmed-8926306 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | The Biophysical Society of Japan |
record_format | MEDLINE/PubMed |
spelling | pubmed-89263062022-05-04 Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank Kondo, Ryohei Kasahara, Kota Takahashi, Takuya Biophys Physicobiol Regular Article Elucidating the principles of sequence–structure relationships of proteins is a long-standing issue in biology. The nature of a short segment of a protein is determined by both the subsequence of the segment itself and its environment. For example, a type of subsequence, the so-called chameleon sequences, can form different secondary structures depending on its environments. Chameleon sequences are considered to have a weak tendency to form a specific structure. Although many chameleon sequences have been identified, they are only a small part of all possible subsequences in the proteome. The strength of the tendency to take a specific structure for each subsequence has not been fully quantified. In this study, we comprehensively analyzed subsequences consisting of four to nine amino acid residues, or N-gram (4≤N≤9), observed in non-redundant sequences in the Protein Data Bank (PDB). Tendencies to form a specific structure in terms of the secondary structure and accessible surface area are quantified as information quantities for each N-gram. Although the majority of observed subsequences have low information quantity due to lack of samples in the current PDB, thousands of N-grams with strong tendencies, including known structural motifs, were found. In addition, machine learning partially predicted the tendency of unknown N-grams, and thus, this technique helps to extract knowledge from the limited number of samples in the PDB. The Biophysical Society of Japan 2022-02-08 /pmc/articles/PMC8926306/ /pubmed/35532457 http://dx.doi.org/10.2142/biophysico.bppb-v19.0002 Text en 2022 THE BIOPHYSICAL SOCIETY OF JAPAN https://creativecommons.org/licenses/by-nc-sa/4.0/This article is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit
https://creativecommons.org/licenses/by-nc-sa/4.0/. |
spellingShingle | Regular Article Kondo, Ryohei Kasahara, Kota Takahashi, Takuya Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank |
title | Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank |
title_full | Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank |
title_fullStr | Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank |
title_full_unstemmed | Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank |
title_short | Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank |
title_sort | information quantity for secondary structure propensities of protein subsequences in the protein data bank |
topic | Regular Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8926306/ https://www.ncbi.nlm.nih.gov/pubmed/35532457 http://dx.doi.org/10.2142/biophysico.bppb-v19.0002 |
work_keys_str_mv | AT kondoryohei informationquantityforsecondarystructurepropensitiesofproteinsubsequencesintheproteindatabank AT kasaharakota informationquantityforsecondarystructurepropensitiesofproteinsubsequencesintheproteindatabank AT takahashitakuya informationquantityforsecondarystructurepropensitiesofproteinsubsequencesintheproteindatabank |