Cargando…
Unique function words characterize genomic proteins
Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biolog...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
National Academy of Sciences
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6042118/ https://www.ncbi.nlm.nih.gov/pubmed/29895692 http://dx.doi.org/10.1073/pnas.1801182115 |
_version_ | 1783339093739962368 |
---|---|
author | Scaiewicz, Andrea Levitt, Michael |
author_facet | Scaiewicz, Andrea Levitt, Michael |
author_sort | Scaiewicz, Andrea |
collection | PubMed |
description | Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of “words” or UFWs (57% shared), the “sentences” (MDAs) are different (1.3% shared). |
format | Online Article Text |
id | pubmed-6042118 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | National Academy of Sciences |
record_format | MEDLINE/PubMed |
spelling | pubmed-60421182018-07-13 Unique function words characterize genomic proteins Scaiewicz, Andrea Levitt, Michael Proc Natl Acad Sci U S A Biological Sciences Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFWs increased more slowly by 30%, indicating that the number of UFWs may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFWs in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFWs in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of “words” or UFWs (57% shared), the “sentences” (MDAs) are different (1.3% shared). National Academy of Sciences 2018-06-26 2018-06-12 /pmc/articles/PMC6042118/ /pubmed/29895692 http://dx.doi.org/10.1073/pnas.1801182115 Text en Copyright © 2018 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) . |
spellingShingle | Biological Sciences Scaiewicz, Andrea Levitt, Michael Unique function words characterize genomic proteins |
title | Unique function words characterize genomic proteins |
title_full | Unique function words characterize genomic proteins |
title_fullStr | Unique function words characterize genomic proteins |
title_full_unstemmed | Unique function words characterize genomic proteins |
title_short | Unique function words characterize genomic proteins |
title_sort | unique function words characterize genomic proteins |
topic | Biological Sciences |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6042118/ https://www.ncbi.nlm.nih.gov/pubmed/29895692 http://dx.doi.org/10.1073/pnas.1801182115 |
work_keys_str_mv | AT scaiewiczandrea uniquefunctionwordscharacterizegenomicproteins AT levittmichael uniquefunctionwordscharacterizegenomicproteins |