Cargando…
Leveraging protein language models for accurate multiple sequence alignments
Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrice...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Cold Spring Harbor Laboratory Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538487/ https://www.ncbi.nlm.nih.gov/pubmed/37414576 http://dx.doi.org/10.1101/gr.277675.123 |
_version_ | 1785113317240471552 |
---|---|
author | McWhite, Claire D. Armour-Garb, Isabel Singh, Mona |
author_facet | McWhite, Claire D. Armour-Garb, Isabel Singh, Mona |
author_sort | McWhite, Claire D. |
collection | PubMed |
description | Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs. |
format | Online Article Text |
id | pubmed-10538487 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Cold Spring Harbor Laboratory Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-105384872023-09-29 Leveraging protein language models for accurate multiple sequence alignments McWhite, Claire D. Armour-Garb, Isabel Singh, Mona Genome Res Methods Multiple sequence alignment (MSA) is a critical step in the study of protein sequence and function. Typically, MSA algorithms progressively align pairs of sequences and combine these alignments with the aid of a guide tree. These alignment algorithms use scoring systems based on substitution matrices to measure amino acid similarities. Although successful, standard methods struggle on sets of proteins with low sequence identity: the so-called twilight zone of protein alignment. For these difficult cases, another source of information is needed. Protein language models are a powerful new approach that leverages massive sequence data sets to produce high-dimensional contextual embeddings for each amino acid in a sequence. These embeddings have been shown to reflect physicochemical and higher-order structural and functional attributes of amino acids within proteins. Here, we present a novel approach to MSA, based on clustering and ordering amino acid contextual embeddings. Our method for aligning semantically consistent groups of proteins circumvents the need for many standard components of MSA algorithms, avoiding initial guide tree construction, intermediate pairwise alignments, gap penalties, and substitution matrices. The added information from contextual embeddings leads to higher accuracy alignments for structurally similar proteins with low amino-acid similarity. We anticipate that protein language models will become a fundamental component of the next generation of algorithms for generating MSAs. Cold Spring Harbor Laboratory Press 2023-07 /pmc/articles/PMC10538487/ /pubmed/37414576 http://dx.doi.org/10.1101/gr.277675.123 Text en © 2023 McWhite et al.; Published by Cold Spring Harbor Laboratory Press https://creativecommons.org/licenses/by-nc/4.0/This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/ (https://creativecommons.org/licenses/by-nc/4.0/) . |
spellingShingle | Methods McWhite, Claire D. Armour-Garb, Isabel Singh, Mona Leveraging protein language models for accurate multiple sequence alignments |
title | Leveraging protein language models for accurate multiple sequence alignments |
title_full | Leveraging protein language models for accurate multiple sequence alignments |
title_fullStr | Leveraging protein language models for accurate multiple sequence alignments |
title_full_unstemmed | Leveraging protein language models for accurate multiple sequence alignments |
title_short | Leveraging protein language models for accurate multiple sequence alignments |
title_sort | leveraging protein language models for accurate multiple sequence alignments |
topic | Methods |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10538487/ https://www.ncbi.nlm.nih.gov/pubmed/37414576 http://dx.doi.org/10.1101/gr.277675.123 |
work_keys_str_mv | AT mcwhiteclaired leveragingproteinlanguagemodelsforaccuratemultiplesequencealignments AT armourgarbisabel leveragingproteinlanguagemodelsforaccuratemultiplesequencealignments AT singhmona leveragingproteinlanguagemodelsforaccuratemultiplesequencealignments |