Cargando…

Bridging the gaps in statistical models of protein alignment

SUMMARY: Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationshi...

Descripción completa

Detalles Bibliográficos
Autores principales: Sumanaweera, Dinithi, Allison, Lloyd, Konagurthu, Arun S
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235498/
https://www.ncbi.nlm.nih.gov/pubmed/35758809
http://dx.doi.org/10.1093/bioinformatics/btac246
_version_ 1784736324755914752
author Sumanaweera, Dinithi
Allison, Lloyd
Konagurthu, Arun S
author_facet Sumanaweera, Dinithi
Allison, Lloyd
Konagurthu, Arun S
author_sort Sumanaweera, Dinithi
collection PubMed
description SUMMARY: Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together. To overcome this gap, this article demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed using a time-parameterized substitution matrix and a time-parameterized alignment state machine. Methods to derive all parameters of such a model from any benchmark collection of aligned protein sequences are described here. This has not only allowed us to generate a unified statistical model for each of the nine widely used substitution matrices (PAM, JTT, BLOSUM, JO, WAG, VTML, LG, MIQS and PFASUM), but also resulted in a new unified model, MMLSUM. Our underlying methodology measures the Shannon information content using each model to explain losslessly any given collection of alignments, which has allowed us to quantify the performance of all the above models on six comprehensive alignment benchmarks. Our results show that MMLSUM results in a new and clear overall best performance, followed by PFASUM, VTML, BLOSUM and MIQS, respectively, amongst the top five. We further analyze the statistical properties of MMLSUM model and contrast it with others. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9235498
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-92354982022-06-29 Bridging the gaps in statistical models of protein alignment Sumanaweera, Dinithi Allison, Lloyd Konagurthu, Arun S Bioinformatics ISCB/Ismb 2022 SUMMARY: Sequences of proteins evolve by accumulating substitutions together with insertions and deletions (indels) of amino acids. However, it remains a common practice to disconnect substitutions and indels, and infer approximate models for each of them separately, to quantify sequence relationships. Although this approach brings with it computational convenience (which remains its primary motivation), there is a dearth of attempts to unify and model them systematically and together. To overcome this gap, this article demonstrates how a complete statistical model quantifying the evolution of pairs of aligned proteins can be constructed using a time-parameterized substitution matrix and a time-parameterized alignment state machine. Methods to derive all parameters of such a model from any benchmark collection of aligned protein sequences are described here. This has not only allowed us to generate a unified statistical model for each of the nine widely used substitution matrices (PAM, JTT, BLOSUM, JO, WAG, VTML, LG, MIQS and PFASUM), but also resulted in a new unified model, MMLSUM. Our underlying methodology measures the Shannon information content using each model to explain losslessly any given collection of alignments, which has allowed us to quantify the performance of all the above models on six comprehensive alignment benchmarks. Our results show that MMLSUM results in a new and clear overall best performance, followed by PFASUM, VTML, BLOSUM and MIQS, respectively, amongst the top five. We further analyze the statistical properties of MMLSUM model and contrast it with others. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-06-27 /pmc/articles/PMC9235498/ /pubmed/35758809 http://dx.doi.org/10.1093/bioinformatics/btac246 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle ISCB/Ismb 2022
Sumanaweera, Dinithi
Allison, Lloyd
Konagurthu, Arun S
Bridging the gaps in statistical models of protein alignment
title Bridging the gaps in statistical models of protein alignment
title_full Bridging the gaps in statistical models of protein alignment
title_fullStr Bridging the gaps in statistical models of protein alignment
title_full_unstemmed Bridging the gaps in statistical models of protein alignment
title_short Bridging the gaps in statistical models of protein alignment
title_sort bridging the gaps in statistical models of protein alignment
topic ISCB/Ismb 2022
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9235498/
https://www.ncbi.nlm.nih.gov/pubmed/35758809
http://dx.doi.org/10.1093/bioinformatics/btac246
work_keys_str_mv AT sumanaweeradinithi bridgingthegapsinstatisticalmodelsofproteinalignment
AT allisonlloyd bridgingthegapsinstatisticalmodelsofproteinalignment
AT konagurthuaruns bridgingthegapsinstatisticalmodelsofproteinalignment