Cargando…

Nearest neighbor search on embeddings rapidly identifies distant protein relations

Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as “homology detection”) use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly...

Descripción completa

Detalles Bibliográficos
Autores principales: Schütze, Konstantin, Heinzinger, Michael, Steinegger, Martin, Rost, Burkhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Frontiers Media S.A. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9714024/
https://www.ncbi.nlm.nih.gov/pubmed/36466147
http://dx.doi.org/10.3389/fbinf.2022.1033775
_version_ 1784842134770155520
author Schütze, Konstantin
Heinzinger, Michael
Steinegger, Martin
Rost, Burkhard
author_facet Schütze, Konstantin
Heinzinger, Michael
Steinegger, Martin
Rost, Burkhard
author_sort Schütze, Konstantin
collection PubMed
description Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as “homology detection”) use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.
format Online
Article
Text
id pubmed-9714024
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Frontiers Media S.A.
record_format MEDLINE/PubMed
spelling pubmed-97140242022-12-02 Nearest neighbor search on embeddings rapidly identifies distant protein relations Schütze, Konstantin Heinzinger, Michael Steinegger, Martin Rost, Burkhard Front Bioinform Bioinformatics Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as “homology detection”) use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others. Frontiers Media S.A. 2022-11-17 /pmc/articles/PMC9714024/ /pubmed/36466147 http://dx.doi.org/10.3389/fbinf.2022.1033775 Text en Copyright © 2022 Schütze, Heinzinger, Steinegger and Rost. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
spellingShingle Bioinformatics
Schütze, Konstantin
Heinzinger, Michael
Steinegger, Martin
Rost, Burkhard
Nearest neighbor search on embeddings rapidly identifies distant protein relations
title Nearest neighbor search on embeddings rapidly identifies distant protein relations
title_full Nearest neighbor search on embeddings rapidly identifies distant protein relations
title_fullStr Nearest neighbor search on embeddings rapidly identifies distant protein relations
title_full_unstemmed Nearest neighbor search on embeddings rapidly identifies distant protein relations
title_short Nearest neighbor search on embeddings rapidly identifies distant protein relations
title_sort nearest neighbor search on embeddings rapidly identifies distant protein relations
topic Bioinformatics
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9714024/
https://www.ncbi.nlm.nih.gov/pubmed/36466147
http://dx.doi.org/10.3389/fbinf.2022.1033775
work_keys_str_mv AT schutzekonstantin nearestneighborsearchonembeddingsrapidlyidentifiesdistantproteinrelations
AT heinzingermichael nearestneighborsearchonembeddingsrapidlyidentifiesdistantproteinrelations
AT steineggermartin nearestneighborsearchonembeddingsrapidlyidentifiesdistantproteinrelations
AT rostburkhard nearestneighborsearchonembeddingsrapidlyidentifiesdistantproteinrelations