Cargando…

Grammar of protein domain architectures

From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinat...

Descripción completa

Detalles Bibliográficos
Autores principales: Yu, Lijia, Tanwar, Deepak Kumar, Penha, Emanuel Diego S., Wolf, Yuri I., Koonin, Eugene V., Basu, Malay Kumar
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6397568/
https://www.ncbi.nlm.nih.gov/pubmed/30733291
http://dx.doi.org/10.1073/pnas.1814684116
_version_ 1783399435932270592
author Yu, Lijia
Tanwar, Deepak Kumar
Penha, Emanuel Diego S.
Wolf, Yuri I.
Koonin, Eugene V.
Basu, Malay Kumar
author_facet Yu, Lijia
Tanwar, Deepak Kumar
Penha, Emanuel Diego S.
Wolf, Yuri I.
Koonin, Eugene V.
Basu, Malay Kumar
author_sort Yu, Lijia
collection PubMed
description From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinations is viable in evolution. We employ a popular linguistic technique, n-gram analysis, to probe the “proteome grammar”—that is, the rules of association of domains that generate various domain architectures of proteins. Comparison of the complexity measures of “protein languages” in major branches of life shows that the relative entropy difference (information gain) between the observed domain architectures and random domain combinations is highly conserved in evolution and is close to being a universal constant, at ∼1.2 bits. Substantial deviations from this constant are observed in only two major groups of organisms: a subset of Archaea that appears to be cells simplified to the limit, and animals that display extreme complexity. We also identify the n-grams that represent signatures of the major branches of cellular life. The results of this analysis bolster the analogy between genomes and natural language and show that a “quasi-universal grammar” underlies the evolution of domain architectures in all divisions of cellular life. The nearly universal value of information gain by the domain architectures could reflect the minimum complexity of signal processing that is required to maintain a functioning cell.
format Online
Article
Text
id pubmed-6397568
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-63975682019-03-06 Grammar of protein domain architectures Yu, Lijia Tanwar, Deepak Kumar Penha, Emanuel Diego S. Wolf, Yuri I. Koonin, Eugene V. Basu, Malay Kumar Proc Natl Acad Sci U S A PNAS Plus From an abstract, informational perspective, protein domains appear analogous to words in natural languages in which the rules of word association are dictated by linguistic rules, or grammar. Such rules exist for protein domains as well, because only a small fraction of all possible domain combinations is viable in evolution. We employ a popular linguistic technique, n-gram analysis, to probe the “proteome grammar”—that is, the rules of association of domains that generate various domain architectures of proteins. Comparison of the complexity measures of “protein languages” in major branches of life shows that the relative entropy difference (information gain) between the observed domain architectures and random domain combinations is highly conserved in evolution and is close to being a universal constant, at ∼1.2 bits. Substantial deviations from this constant are observed in only two major groups of organisms: a subset of Archaea that appears to be cells simplified to the limit, and animals that display extreme complexity. We also identify the n-grams that represent signatures of the major branches of cellular life. The results of this analysis bolster the analogy between genomes and natural language and show that a “quasi-universal grammar” underlies the evolution of domain architectures in all divisions of cellular life. The nearly universal value of information gain by the domain architectures could reflect the minimum complexity of signal processing that is required to maintain a functioning cell. National Academy of Sciences 2019-02-26 2019-02-07 /pmc/articles/PMC6397568/ /pubmed/30733291 http://dx.doi.org/10.1073/pnas.1814684116 Text en Copyright © 2019 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle PNAS Plus
Yu, Lijia
Tanwar, Deepak Kumar
Penha, Emanuel Diego S.
Wolf, Yuri I.
Koonin, Eugene V.
Basu, Malay Kumar
Grammar of protein domain architectures
title Grammar of protein domain architectures
title_full Grammar of protein domain architectures
title_fullStr Grammar of protein domain architectures
title_full_unstemmed Grammar of protein domain architectures
title_short Grammar of protein domain architectures
title_sort grammar of protein domain architectures
topic PNAS Plus
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6397568/
https://www.ncbi.nlm.nih.gov/pubmed/30733291
http://dx.doi.org/10.1073/pnas.1814684116
work_keys_str_mv AT yulijia grammarofproteindomainarchitectures
AT tanwardeepakkumar grammarofproteindomainarchitectures
AT penhaemanueldiegos grammarofproteindomainarchitectures
AT wolfyurii grammarofproteindomainarchitectures
AT koonineugenev grammarofproteindomainarchitectures
AT basumalaykumar grammarofproteindomainarchitectures