Cargando…
Nearest neighbor search on embeddings rapidly identifies distant protein relations
Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as “homology detection”) use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Frontiers Media S.A.
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9714024/ https://www.ncbi.nlm.nih.gov/pubmed/36466147 http://dx.doi.org/10.3389/fbinf.2022.1033775 |
_version_ | 1784842134770155520 |
---|---|
author | Schütze, Konstantin Heinzinger, Michael Steinegger, Martin Rost, Burkhard |
author_facet | Schütze, Konstantin Heinzinger, Michael Steinegger, Martin Rost, Burkhard |
author_sort | Schütze, Konstantin |
collection | PubMed |
description | Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as “homology detection”) use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others. |
format | Online Article Text |
id | pubmed-9714024 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Frontiers Media S.A. |
record_format | MEDLINE/PubMed |
spelling | pubmed-97140242022-12-02 Nearest neighbor search on embeddings rapidly identifies distant protein relations Schütze, Konstantin Heinzinger, Michael Steinegger, Martin Rost, Burkhard Front Bioinform Bioinformatics Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as “homology detection”) use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others. Frontiers Media S.A. 2022-11-17 /pmc/articles/PMC9714024/ /pubmed/36466147 http://dx.doi.org/10.3389/fbinf.2022.1033775 Text en Copyright © 2022 Schütze, Heinzinger, Steinegger and Rost. https://creativecommons.org/licenses/by/4.0/This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms. |
spellingShingle | Bioinformatics Schütze, Konstantin Heinzinger, Michael Steinegger, Martin Rost, Burkhard Nearest neighbor search on embeddings rapidly identifies distant protein relations |
title | Nearest neighbor search on embeddings rapidly identifies distant protein relations |
title_full | Nearest neighbor search on embeddings rapidly identifies distant protein relations |
title_fullStr | Nearest neighbor search on embeddings rapidly identifies distant protein relations |
title_full_unstemmed | Nearest neighbor search on embeddings rapidly identifies distant protein relations |
title_short | Nearest neighbor search on embeddings rapidly identifies distant protein relations |
title_sort | nearest neighbor search on embeddings rapidly identifies distant protein relations |
topic | Bioinformatics |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9714024/ https://www.ncbi.nlm.nih.gov/pubmed/36466147 http://dx.doi.org/10.3389/fbinf.2022.1033775 |
work_keys_str_mv | AT schutzekonstantin nearestneighborsearchonembeddingsrapidlyidentifiesdistantproteinrelations AT heinzingermichael nearestneighborsearchonembeddingsrapidlyidentifiesdistantproteinrelations AT steineggermartin nearestneighborsearchonembeddingsrapidlyidentifiesdistantproteinrelations AT rostburkhard nearestneighborsearchonembeddingsrapidlyidentifiesdistantproteinrelations |