Cargando…

Low Complexity Regions in Proteins and DNA are Poorly Correlated

Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying DNA sequence. Here, we examine the relationship between the entropy of the protein sequence and that of the DNA sequence which enco...

Descripción completa

Detalles Bibliográficos
Autores principales: Enright, Johanna M, Dickson, Zachery W, Golding, G Brian
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10124876/
https://www.ncbi.nlm.nih.gov/pubmed/37036379
http://dx.doi.org/10.1093/molbev/msad084
_version_ 1785029927023673344
author Enright, Johanna M
Dickson, Zachery W
Golding, G Brian
author_facet Enright, Johanna M
Dickson, Zachery W
Golding, G Brian
author_sort Enright, Johanna M
collection PubMed
description Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying DNA sequence. Here, we examine the relationship between the entropy of the protein sequence and that of the DNA sequence which encodes it. We show that they are poorly correlated whether starting with a low complexity region within the protein and comparing it to the corresponding sequence in the DNA or by finding a low complexity region within coding DNA and comparing it to the corresponding sequence in the protein. We show this is the case within the proteomes of five model organisms: Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. We also report a significant bias against mononucleic codons in LCR encoding sequences. By comparison with simulated proteomes, we show that highly repetitive LCRs may be explained by neutral, slippage-based evolution, but compositionally biased LCRs with cryptic repeats are not. We demonstrate that other biological biases and forces must be acting to create and maintain these LCRs. Uncovering these forces will improve our understanding of protein LCR evolution.
format Online
Article
Text
id pubmed-10124876
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-101248762023-04-25 Low Complexity Regions in Proteins and DNA are Poorly Correlated Enright, Johanna M Dickson, Zachery W Golding, G Brian Mol Biol Evol Discoveries Low complexity sequences (LCRs) are well known within coding as well as non-coding sequences. A low complexity region within a protein must be encoded by the underlying DNA sequence. Here, we examine the relationship between the entropy of the protein sequence and that of the DNA sequence which encodes it. We show that they are poorly correlated whether starting with a low complexity region within the protein and comparing it to the corresponding sequence in the DNA or by finding a low complexity region within coding DNA and comparing it to the corresponding sequence in the protein. We show this is the case within the proteomes of five model organisms: Homo sapiens, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Arabidopsis thaliana. We also report a significant bias against mononucleic codons in LCR encoding sequences. By comparison with simulated proteomes, we show that highly repetitive LCRs may be explained by neutral, slippage-based evolution, but compositionally biased LCRs with cryptic repeats are not. We demonstrate that other biological biases and forces must be acting to create and maintain these LCRs. Uncovering these forces will improve our understanding of protein LCR evolution. Oxford University Press 2023-04-10 /pmc/articles/PMC10124876/ /pubmed/37036379 http://dx.doi.org/10.1093/molbev/msad084 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Discoveries
Enright, Johanna M
Dickson, Zachery W
Golding, G Brian
Low Complexity Regions in Proteins and DNA are Poorly Correlated
title Low Complexity Regions in Proteins and DNA are Poorly Correlated
title_full Low Complexity Regions in Proteins and DNA are Poorly Correlated
title_fullStr Low Complexity Regions in Proteins and DNA are Poorly Correlated
title_full_unstemmed Low Complexity Regions in Proteins and DNA are Poorly Correlated
title_short Low Complexity Regions in Proteins and DNA are Poorly Correlated
title_sort low complexity regions in proteins and dna are poorly correlated
topic Discoveries
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10124876/
https://www.ncbi.nlm.nih.gov/pubmed/37036379
http://dx.doi.org/10.1093/molbev/msad084
work_keys_str_mv AT enrightjohannam lowcomplexityregionsinproteinsanddnaarepoorlycorrelated
AT dicksonzacheryw lowcomplexityregionsinproteinsanddnaarepoorlycorrelated
AT goldinggbrian lowcomplexityregionsinproteinsanddnaarepoorlycorrelated