Cargando…

Self-analysis of repeat proteins reveals evolutionarily conserved patterns

BACKGROUND: Protein repeats can confound sequence analyses because the repetitiveness of their amino acid sequences lead to difficulties in identifying whether similar repeats are due to convergent or divergent evolution. We noted that the patterns derived from traditional “dot plot” protein sequenc...

Descripción completa

Detalles Bibliográficos
Autores principales:	Merski, Matthew, Młynarczyk, Krzysztof, Ludwiczak, Jan, Skrzeczkowski, Jakub, Dunin-Horkawicz, Stanisław, Górna, Maria W.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7204011/ https://www.ncbi.nlm.nih.gov/pubmed/32381046 http://dx.doi.org/10.1186/s12859-020-3493-y

_version_	1783529976642928640
author	Merski, Matthew Młynarczyk, Krzysztof Ludwiczak, Jan Skrzeczkowski, Jakub Dunin-Horkawicz, Stanisław Górna, Maria W.
author_facet	Merski, Matthew Młynarczyk, Krzysztof Ludwiczak, Jan Skrzeczkowski, Jakub Dunin-Horkawicz, Stanisław Górna, Maria W.
author_sort	Merski, Matthew
collection	PubMed
description	BACKGROUND: Protein repeats can confound sequence analyses because the repetitiveness of their amino acid sequences lead to difficulties in identifying whether similar repeats are due to convergent or divergent evolution. We noted that the patterns derived from traditional “dot plot” protein sequence self-similarity analysis tended to be conserved in sets of related repeat proteins and this conservation could be quantitated using a Jaccard metric. RESULTS: Comparison of these dot plots obviated the issues due to sequence similarity for analysis of repeat proteins. A high Jaccard similarity score was suggestive of a conserved relationship between closely related repeat proteins. The dot plot patterns decayed quickly in the absence of selective pressure with an expected loss of 50% of Jaccard similarity due to a loss of 8.2% sequence identity. To perform method testing, we assembled a standard set of 79 repeat proteins representing all the subgroups in RepeatsDB. Comparison of known repeat and non-repeat proteins from the PDB suggested that the information content in dot plots could be used to identify repeat proteins from pure sequence with no requirement for structural information. Analysis of the UniRef90 database suggested that 16.9% of all known proteins could be classified as repeat proteins. These 13.3 million putative repeat protein chains were clustered and a significant amount (82.9%) of clusters containing between 5 and 200 members were of a single functional type. CONCLUSIONS: Dot plot analysis of repeat proteins attempts to obviate issues that arise due to the sequence degeneracy of repeat proteins. These results show that this kind of analysis can efficiently be applied to analyze repeat proteins on a large scale.
format	Online Article Text
id	pubmed-7204011
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-72040112020-05-12 Self-analysis of repeat proteins reveals evolutionarily conserved patterns Merski, Matthew Młynarczyk, Krzysztof Ludwiczak, Jan Skrzeczkowski, Jakub Dunin-Horkawicz, Stanisław Górna, Maria W. BMC Bioinformatics Research Article BACKGROUND: Protein repeats can confound sequence analyses because the repetitiveness of their amino acid sequences lead to difficulties in identifying whether similar repeats are due to convergent or divergent evolution. We noted that the patterns derived from traditional “dot plot” protein sequence self-similarity analysis tended to be conserved in sets of related repeat proteins and this conservation could be quantitated using a Jaccard metric. RESULTS: Comparison of these dot plots obviated the issues due to sequence similarity for analysis of repeat proteins. A high Jaccard similarity score was suggestive of a conserved relationship between closely related repeat proteins. The dot plot patterns decayed quickly in the absence of selective pressure with an expected loss of 50% of Jaccard similarity due to a loss of 8.2% sequence identity. To perform method testing, we assembled a standard set of 79 repeat proteins representing all the subgroups in RepeatsDB. Comparison of known repeat and non-repeat proteins from the PDB suggested that the information content in dot plots could be used to identify repeat proteins from pure sequence with no requirement for structural information. Analysis of the UniRef90 database suggested that 16.9% of all known proteins could be classified as repeat proteins. These 13.3 million putative repeat protein chains were clustered and a significant amount (82.9%) of clusters containing between 5 and 200 members were of a single functional type. CONCLUSIONS: Dot plot analysis of repeat proteins attempts to obviate issues that arise due to the sequence degeneracy of repeat proteins. These results show that this kind of analysis can efficiently be applied to analyze repeat proteins on a large scale. BioMed Central 2020-05-07 /pmc/articles/PMC7204011/ /pubmed/32381046 http://dx.doi.org/10.1186/s12859-020-3493-y Text en © The Author(s). 2020 Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Article Merski, Matthew Młynarczyk, Krzysztof Ludwiczak, Jan Skrzeczkowski, Jakub Dunin-Horkawicz, Stanisław Górna, Maria W. Self-analysis of repeat proteins reveals evolutionarily conserved patterns
title	Self-analysis of repeat proteins reveals evolutionarily conserved patterns
title_full	Self-analysis of repeat proteins reveals evolutionarily conserved patterns
title_fullStr	Self-analysis of repeat proteins reveals evolutionarily conserved patterns
title_full_unstemmed	Self-analysis of repeat proteins reveals evolutionarily conserved patterns
title_short	Self-analysis of repeat proteins reveals evolutionarily conserved patterns
title_sort	self-analysis of repeat proteins reveals evolutionarily conserved patterns
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7204011/ https://www.ncbi.nlm.nih.gov/pubmed/32381046 http://dx.doi.org/10.1186/s12859-020-3493-y
work_keys_str_mv	AT merskimatthew selfanalysisofrepeatproteinsrevealsevolutionarilyconservedpatterns AT młynarczykkrzysztof selfanalysisofrepeatproteinsrevealsevolutionarilyconservedpatterns AT ludwiczakjan selfanalysisofrepeatproteinsrevealsevolutionarilyconservedpatterns AT skrzeczkowskijakub selfanalysisofrepeatproteinsrevealsevolutionarilyconservedpatterns AT duninhorkawiczstanisław selfanalysisofrepeatproteinsrevealsevolutionarilyconservedpatterns AT gornamariaw selfanalysisofrepeatproteinsrevealsevolutionarilyconservedpatterns

Self-analysis of repeat proteins reveals evolutionarily conserved patterns

Ejemplares similares