Cargando…

Contrastive learning on protein embeddings enlightens midnight zone

Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from...

Descripción completa

Detalles Bibliográficos
Autores principales: Heinzinger, Michael, Littmann, Maria, Sillitoe, Ian, Bordin, Nicola, Orengo, Christine, Rost, Burkhard
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9188115/
https://www.ncbi.nlm.nih.gov/pubmed/35702380
http://dx.doi.org/10.1093/nargab/lqac043
_version_ 1784725305075695616
author Heinzinger, Michael
Littmann, Maria
Sillitoe, Ian
Bordin, Nicola
Orengo, Christine
Rost, Burkhard
author_facet Heinzinger, Michael
Littmann, Maria
Sillitoe, Ian
Bordin, Nicola
Orengo, Christine
Rost, Burkhard
author_sort Heinzinger, Michael
collection PubMed
description Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the ‘midnight zone’ of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT.
format Online
Article
Text
id pubmed-9188115
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-91881152022-06-13 Contrastive learning on protein embeddings enlightens midnight zone Heinzinger, Michael Littmann, Maria Sillitoe, Ian Bordin, Nicola Orengo, Christine Rost, Burkhard NAR Genom Bioinform Standard Article Experimental structures are leveraged through multiple sequence alignments, or more generally through homology-based inference (HBI), facilitating the transfer of information from a protein with known annotation to a query without any annotation. A recent alternative expands the concept of HBI from sequence-distance lookup to embedding-based annotation transfer (EAT). These embeddings are derived from protein Language Models (pLMs). Here, we introduce using single protein representations from pLMs for contrastive learning. This learning procedure creates a new set of embeddings that optimizes constraints captured by hierarchical classifications of protein 3D structures defined by the CATH resource. The approach, dubbed ProtTucker, has an improved ability to recognize distant homologous relationships than more traditional techniques such as threading or fold recognition. Thus, these embeddings have allowed sequence comparison to step into the ‘midnight zone’ of protein similarity, i.e. the region in which distantly related sequences have a seemingly random pairwise sequence similarity. The novelty of this work is in the particular combination of tools and sampling techniques that ascertained good performance comparable or better to existing state-of-the-art sequence comparison methods. Additionally, since this method does not need to generate alignments it is also orders of magnitudes faster. The code is available at https://github.com/Rostlab/EAT. Oxford University Press 2022-06-11 /pmc/articles/PMC9188115/ /pubmed/35702380 http://dx.doi.org/10.1093/nargab/lqac043 Text en © The Author(s) 2022. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Standard Article
Heinzinger, Michael
Littmann, Maria
Sillitoe, Ian
Bordin, Nicola
Orengo, Christine
Rost, Burkhard
Contrastive learning on protein embeddings enlightens midnight zone
title Contrastive learning on protein embeddings enlightens midnight zone
title_full Contrastive learning on protein embeddings enlightens midnight zone
title_fullStr Contrastive learning on protein embeddings enlightens midnight zone
title_full_unstemmed Contrastive learning on protein embeddings enlightens midnight zone
title_short Contrastive learning on protein embeddings enlightens midnight zone
title_sort contrastive learning on protein embeddings enlightens midnight zone
topic Standard Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9188115/
https://www.ncbi.nlm.nih.gov/pubmed/35702380
http://dx.doi.org/10.1093/nargab/lqac043
work_keys_str_mv AT heinzingermichael contrastivelearningonproteinembeddingsenlightensmidnightzone
AT littmannmaria contrastivelearningonproteinembeddingsenlightensmidnightzone
AT sillitoeian contrastivelearningonproteinembeddingsenlightensmidnightzone
AT bordinnicola contrastivelearningonproteinembeddingsenlightensmidnightzone
AT orengochristine contrastivelearningonproteinembeddingsenlightensmidnightzone
AT rostburkhard contrastivelearningonproteinembeddingsenlightensmidnightzone