Cargando…

Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach

The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino aci...

Descripción completa

Detalles Bibliográficos
Autores principales:	Motomura, Kenta, Fujita, Tomohiro, Tsutsumi, Motosuke, Kikuzato, Satsuki, Nakamura, Morikazu, Otaki, Joji M.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2012
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3503725/ https://www.ncbi.nlm.nih.gov/pubmed/23185527 http://dx.doi.org/10.1371/journal.pone.0050039

_version_	1782250494038638592
author	Motomura, Kenta Fujita, Tomohiro Tsutsumi, Motosuke Kikuzato, Satsuki Nakamura, Morikazu Otaki, Joji M.
author_facet	Motomura, Kenta Fujita, Tomohiro Tsutsumi, Motosuke Kikuzato, Satsuki Nakamura, Morikazu Otaki, Joji M.
author_sort	Motomura, Kenta
collection	PubMed
description	The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or “words”. We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and “compressed” English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., “key words”) and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences.
format	Online Article Text
id	pubmed-3503725
institution	National Center for Biotechnology Information
language	English
publishDate	2012
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-35037252012-11-26 Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach Motomura, Kenta Fujita, Tomohiro Tsutsumi, Motosuke Kikuzato, Satsuki Nakamura, Morikazu Otaki, Joji M. PLoS One Research Article The amino acid sequences of proteins determine their three-dimensional structures and functions. However, how sequence information is related to structures and functions is still enigmatic. In this study, we show that at least a part of the sequence information can be extracted by treating amino acid sequences of proteins as a collection of English words, based on a working hypothesis that amino acid sequences of proteins are composed of short constituent amino acid sequences (SCSs) or “words”. We first confirmed that the English language highly likely follows Zipf's law, a special case of power law. We found that the rank-frequency plot of SCSs in proteins exhibits a similar distribution when low-rank tails are excluded. In comparison with natural English and “compressed” English without spaces between words, amino acid sequences of proteins show larger linear ranges and smaller exponents with heavier low-rank tails, demonstrating that the SCS distribution in proteins is largely scale-free. A distribution pattern of SCSs in proteins is similar among species, but species-specific features are also present. Based on the availability scores of SCSs, we found that sequence motifs are enriched in high-availability sites (i.e., “key words”) and vice versa. In fact, the highest availability peak within a given protein sequence often directly corresponds to a sequence motif. The amino acid composition of high-availability sites within motifs is different from that of entire motifs and all protein sequences, suggesting the possible functional importance of specific SCSs and their compositional amino acids within motifs. We anticipate that our availability-based word decoding approach is complementary to sequence alignment approaches in predicting functionally important sites of unknown proteins from their amino acid sequences. Public Library of Science 2012-11-21 /pmc/articles/PMC3503725/ /pubmed/23185527 http://dx.doi.org/10.1371/journal.pone.0050039 Text en © 2012 Motomura et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Motomura, Kenta Fujita, Tomohiro Tsutsumi, Motosuke Kikuzato, Satsuki Nakamura, Morikazu Otaki, Joji M. Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach
title	Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach
title_full	Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach
title_fullStr	Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach
title_full_unstemmed	Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach
title_short	Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach
title_sort	word decoding of protein amino acid sequences with availability analysis: a linguistic approach
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3503725/ https://www.ncbi.nlm.nih.gov/pubmed/23185527 http://dx.doi.org/10.1371/journal.pone.0050039
work_keys_str_mv	AT motomurakenta worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach AT fujitatomohiro worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach AT tsutsumimotosuke worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach AT kikuzatosatsuki worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach AT nakamuramorikazu worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach AT otakijojim worddecodingofproteinaminoacidsequenceswithavailabilityanalysisalinguisticapproach

Word Decoding of Protein Amino Acid Sequences with Availability Analysis: A Linguistic Approach

Ejemplares similares