Cargando…

Leveraging protein language models for accurate multiple sequence alignments

Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrice...

Descripción completa

Detalles Bibliográficos
Autores principales: McWhite, Claire D., Armour-Garb, Isabel, Singh, Mona
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538487/
https://www.ncbi.nlm.nih.gov/pubmed/37414576
http://dx.doi.org/10.1101/gr.277675.123
_version_ 1785113317240471552
author McWhite, Claire D.
Armour-Garb, Isabel
Singh, Mona
author_facet McWhite, Claire D.
Armour-Garb, Isabel
Singh, Mona
author_sort McWhite, Claire D.
collection PubMed
description Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs.
format Online
Article
Text
id pubmed-10538487
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cold Spring Harbor Laboratory Press
record_format MEDLINE/PubMed
spelling pubmed-105384872023-09-29 Leveraging protein language models for accurate multiple sequence alignments McWhite, Claire D. Armour-Garb, Isabel Singh, Mona Genome Res Methods Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs. Cold Spring Harbor Laboratory Press 2023-07 /pmc/articles/PMC10538487/ /pubmed/37414576 http://dx.doi.org/10.1101/gr.277675.123 Text en © 2023 McWhite et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by-nc/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) .
spellingShingle Methods
McWhite, Claire D.
Armour-Garb, Isabel
Singh, Mona
Leveraging protein language models for accurate multiple sequence alignments
title Leveraging protein language models for accurate multiple sequence alignments
title_full Leveraging protein language models for accurate multiple sequence alignments
title_fullStr Leveraging protein language models for accurate multiple sequence alignments
title_full_unstemmed Leveraging protein language models for accurate multiple sequence alignments
title_short Leveraging protein language models for accurate multiple sequence alignments
title_sort leveraging protein language models for accurate multiple sequence alignments
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538487/
https://www.ncbi.nlm.nih.gov/pubmed/37414576
http://dx.doi.org/10.1101/gr.277675.123
work_keys_str_mv AT mcwhiteclaired leveragingproteinlanguagemodelsforaccuratemultiplesequencealignments
AT armourgarbisabel leveragingproteinlanguagemodelsforaccuratemultiplesequencealignments
AT singhmona leveragingproteinlanguagemodelsforaccuratemultiplesequencealignments