Cargando…

Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank

Elucidating the principles of sequence–structure relationships of proteins is a long-standing issue in biology. The nature of a short segment of a protein is determined by both the subsequence of the segment itself and its environment. For example, a type of subsequence, the so-called chameleon sequ...

Descripción completa

Detalles Bibliográficos
Autores principales: Kondo, Ryohei, Kasahara, Kota, Takahashi, Takuya
Formato: Online Artículo Texto
Lenguaje:English
Publicado: The Biophysical Society of Japan 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8926306/
https://www.ncbi.nlm.nih.gov/pubmed/35532457
http://dx.doi.org/10.2142/biophysico.bppb-v19.0002
_version_ 1784670213473566720
author Kondo, Ryohei
Kasahara, Kota
Takahashi, Takuya
author_facet Kondo, Ryohei
Kasahara, Kota
Takahashi, Takuya
author_sort Kondo, Ryohei
collection PubMed
description Elucidating the principles of sequence–structure relationships of proteins is a long-standing issue in biology. The nature of a short segment of a protein is determined by both the subsequence of the segment itself and its environment. For example, a type of subsequence, the so-called chameleon sequences, can form different secondary structures depending on its environments. Chameleon sequences are considered to have a weak tendency to form a specific structure. Although many chameleon sequences have been identified, they are only a small part of all possible subsequences in the proteome. The strength of the tendency to take a specific structure for each subsequence has not been fully quantified. In this study, we comprehensively analyzed subsequences consisting of four to nine amino acid residues, or N-gram (4≤N≤9), observed in non-redundant sequences in the Protein Data Bank (PDB). Tendencies to form a specific structure in terms of the secondary structure and accessible surface area are quantified as information quantities for each N-gram. Although the majority of observed subsequences have low information quantity due to lack of samples in the current PDB, thousands of N-grams with strong tendencies, including known structural motifs, were found. In addition, machine learning partially predicted the tendency of unknown N-grams, and thus, this technique helps to extract knowledge from the limited number of samples in the PDB.
format Online
Article
Text
id pubmed-8926306
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher The Biophysical Society of Japan
record_format MEDLINE/PubMed
spelling pubmed-89263062022-05-04 Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank Kondo, Ryohei Kasahara, Kota Takahashi, Takuya Biophys Physicobiol Regular Article Elucidating the principles of sequence–structure relationships of proteins is a long-standing issue in biology. The nature of a short segment of a protein is determined by both the subsequence of the segment itself and its environment. For example, a type of subsequence, the so-called chameleon sequences, can form different secondary structures depending on its environments. Chameleon sequences are considered to have a weak tendency to form a specific structure. Although many chameleon sequences have been identified, they are only a small part of all possible subsequences in the proteome. The strength of the tendency to take a specific structure for each subsequence has not been fully quantified. In this study, we comprehensively analyzed subsequences consisting of four to nine amino acid residues, or N-gram (4≤N≤9), observed in non-redundant sequences in the Protein Data Bank (PDB). Tendencies to form a specific structure in terms of the secondary structure and accessible surface area are quantified as information quantities for each N-gram. Although the majority of observed subsequences have low information quantity due to lack of samples in the current PDB, thousands of N-grams with strong tendencies, including known structural motifs, were found. In addition, machine learning partially predicted the tendency of unknown N-grams, and thus, this technique helps to extract knowledge from the limited number of samples in the PDB. The Biophysical Society of Japan 2022-02-08 /pmc/articles/PMC8926306/ /pubmed/35532457 http://dx.doi.org/10.2142/biophysico.bppb-v19.0002 Text en 2022 THE BIOPHYSICAL SOCIETY OF JAPAN https://creativecommons.org/licenses/by-nc-sa/4.0/This article is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 Inter­national License. To view a copy of this license, visit 
https://creativecommons.org/licenses/by-nc-sa/4.0/.
spellingShingle Regular Article
Kondo, Ryohei
Kasahara, Kota
Takahashi, Takuya
Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank
title Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank
title_full Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank
title_fullStr Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank
title_full_unstemmed Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank
title_short Information quantity for secondary structure propensities of protein subsequences in the Protein Data Bank
title_sort information quantity for secondary structure propensities of protein subsequences in the protein data bank
topic Regular Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8926306/
https://www.ncbi.nlm.nih.gov/pubmed/35532457
http://dx.doi.org/10.2142/biophysico.bppb-v19.0002
work_keys_str_mv AT kondoryohei informationquantityforsecondarystructurepropensitiesofproteinsubsequencesintheproteindatabank
AT kasaharakota informationquantityforsecondarystructurepropensitiesofproteinsubsequencesintheproteindatabank
AT takahashitakuya informationquantityforsecondarystructurepropensitiesofproteinsubsequencesintheproteindatabank