Cargando…

PPalign: optimal alignment of Potts models representing proteins with direct coupling information

BACKGROUND: To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant ali...

Descripción completa

Detalles Bibliográficos
Autores principales: Talibart, Hugo, Coste, François
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8191105/
https://www.ncbi.nlm.nih.gov/pubmed/34112081
http://dx.doi.org/10.1186/s12859-021-04222-4
_version_ 1783705812779139072
author Talibart, Hugo
Coste, François
author_facet Talibart, Hugo
Coste, François
author_sort Talibart, Hugo
collection PubMed
description BACKGROUND: To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use. METHODS: We introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between [Formula: see text] and [Formula: see text] ) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time ([Formula: see text] in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and independent-site PPalign. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean [Formula: see text] score and finds significantly better alignments than HHalign and PPalign without couplings in some cases. CONCLUSIONS: These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign’s guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction.
format Online
Article
Text
id pubmed-8191105
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-81911052021-06-10 PPalign: optimal alignment of Potts models representing proteins with direct coupling information Talibart, Hugo Coste, François BMC Bioinformatics Research BACKGROUND: To assign structural and functional annotations to the ever increasing amount of sequenced proteins, the main approach relies on sequence-based homology search methods, e.g. BLAST or the current state-of-the-art methods based on profile Hidden Markov Models, which rely on significant alignments of query sequences to annotated proteins or protein families. While powerful, these approaches do not take coevolution between residues into account. Taking advantage of recent advances in the field of contact prediction, we propose here to represent proteins by Potts models, which model direct couplings between positions in addition to positional composition, and to compare proteins by aligning these models. Due to non-local dependencies, the problem of aligning Potts models is hard and remains the main computational bottleneck for their use. METHODS: We introduce here an Integer Linear Programming formulation of the problem and PPalign, a program based on this formulation, to compute the optimal pairwise alignment of Potts models representing proteins in tractable time. The approach is assessed with respect to a non-redundant set of reference pairwise sequence alignments from SISYPHUS benchmark which have lowest sequence identity (between [Formula: see text] and [Formula: see text] ) and enable to build reliable Potts models for each sequence to be aligned. This experimentation confirms that Potts models can be aligned in reasonable time ([Formula: see text] in average on these alignments). The contribution of couplings is evaluated in comparison with HHalign and independent-site PPalign. Although Potts models were not fully optimized for alignment purposes and simple gap scores were used, PPalign yields a better mean [Formula: see text] score and finds significantly better alignments than HHalign and PPalign without couplings in some cases. CONCLUSIONS: These results show that pairwise couplings from protein Potts models can be used to improve the alignment of remotely related protein sequences in tractable time. Our experimentation suggests yet that new research on the inference of Potts models is now needed to make them more comparable and suitable for homology search. We think that PPalign’s guaranteed optimality will be a powerful asset to perform unbiased investigations in this direction. BioMed Central 2021-06-10 /pmc/articles/PMC8191105/ /pubmed/34112081 http://dx.doi.org/10.1186/s12859-021-04222-4 Text en © The Author(s) 2021 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Talibart, Hugo
Coste, François
PPalign: optimal alignment of Potts models representing proteins with direct coupling information
title PPalign: optimal alignment of Potts models representing proteins with direct coupling information
title_full PPalign: optimal alignment of Potts models representing proteins with direct coupling information
title_fullStr PPalign: optimal alignment of Potts models representing proteins with direct coupling information
title_full_unstemmed PPalign: optimal alignment of Potts models representing proteins with direct coupling information
title_short PPalign: optimal alignment of Potts models representing proteins with direct coupling information
title_sort ppalign: optimal alignment of potts models representing proteins with direct coupling information
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8191105/
https://www.ncbi.nlm.nih.gov/pubmed/34112081
http://dx.doi.org/10.1186/s12859-021-04222-4
work_keys_str_mv AT talibarthugo ppalignoptimalalignmentofpottsmodelsrepresentingproteinswithdirectcouplinginformation
AT costefrancois ppalignoptimalalignmentofpottsmodelsrepresentingproteinswithdirectcouplinginformation