Cargando…

Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies

Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primar...

Descripción completa

Detalles Bibliográficos
Autores principales: Yeung, Wayland, Zhou, Zhongliang, Mathew, Liju, Gravel, Nathan, Taujale, Rahil, O’Boyle, Brady, Salcedo, Mariah, Venkat, Aarya, Lanzilotta, William, Li, Sheng, Kannan, Natarajan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9851311/
https://www.ncbi.nlm.nih.gov/pubmed/36642409
http://dx.doi.org/10.1093/bib/bbac619
_version_ 1784872369083383808
author Yeung, Wayland
Zhou, Zhongliang
Mathew, Liju
Gravel, Nathan
Taujale, Rahil
O’Boyle, Brady
Salcedo, Mariah
Venkat, Aarya
Lanzilotta, William
Li, Sheng
Kannan, Natarajan
author_facet Yeung, Wayland
Zhou, Zhongliang
Mathew, Liju
Gravel, Nathan
Taujale, Rahil
O’Boyle, Brady
Salcedo, Mariah
Venkat, Aarya
Lanzilotta, William
Li, Sheng
Kannan, Natarajan
author_sort Yeung, Wayland
collection PubMed
description Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primary sequence alone. While sequence embeddings have been applied toward tasks such as structure and function prediction, applications toward alignment-free sequence classification have been hindered by the lack of studies to derive, quantify and evaluate relationships between protein sequence embeddings. Here, we develop workflows and visualization methods for the classification of protein families using sequence embedding derived from protein language models. A benchmark of manifold visualization methods reveals that Neighbor Joining (NJ) embedding trees are highly effective in capturing global structure while achieving similar performance in capturing local structure compared with popular dimensionality reduction techniques such as t-SNE and UMAP. The statistical significance of hierarchical clusters on a tree is evaluated by resampling embeddings using a variational autoencoder (VAE). We demonstrate the application of our methods in the classification of two well-studied enzyme superfamilies, phosphatases and protein kinases. Our embedding-based classifications remain consistent with and extend upon previously published sequence alignment-based classifications. We also propose a new hierarchical classification for the S-Adenosyl-L-Methionine (SAM) enzyme superfamily which has been difficult to classify using traditional alignment-based approaches. Beyond applications in sequence classification, our results further suggest NJ trees are a promising general method for visualizing high-dimensional data sets.
format Online
Article
Text
id pubmed-9851311
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-98513112023-01-20 Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies Yeung, Wayland Zhou, Zhongliang Mathew, Liju Gravel, Nathan Taujale, Rahil O’Boyle, Brady Salcedo, Mariah Venkat, Aarya Lanzilotta, William Li, Sheng Kannan, Natarajan Brief Bioinform Problem Solving Protocol Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primary sequence alone. While sequence embeddings have been applied toward tasks such as structure and function prediction, applications toward alignment-free sequence classification have been hindered by the lack of studies to derive, quantify and evaluate relationships between protein sequence embeddings. Here, we develop workflows and visualization methods for the classification of protein families using sequence embedding derived from protein language models. A benchmark of manifold visualization methods reveals that Neighbor Joining (NJ) embedding trees are highly effective in capturing global structure while achieving similar performance in capturing local structure compared with popular dimensionality reduction techniques such as t-SNE and UMAP. The statistical significance of hierarchical clusters on a tree is evaluated by resampling embeddings using a variational autoencoder (VAE). We demonstrate the application of our methods in the classification of two well-studied enzyme superfamilies, phosphatases and protein kinases. Our embedding-based classifications remain consistent with and extend upon previously published sequence alignment-based classifications. We also propose a new hierarchical classification for the S-Adenosyl-L-Methionine (SAM) enzyme superfamily which has been difficult to classify using traditional alignment-based approaches. Beyond applications in sequence classification, our results further suggest NJ trees are a promising general method for visualizing high-dimensional data sets. Oxford University Press 2023-01-15 /pmc/articles/PMC9851311/ /pubmed/36642409 http://dx.doi.org/10.1093/bib/bbac619 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Problem Solving Protocol
Yeung, Wayland
Zhou, Zhongliang
Mathew, Liju
Gravel, Nathan
Taujale, Rahil
O’Boyle, Brady
Salcedo, Mariah
Venkat, Aarya
Lanzilotta, William
Li, Sheng
Kannan, Natarajan
Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies
title Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies
title_full Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies
title_fullStr Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies
title_full_unstemmed Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies
title_short Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies
title_sort tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies
topic Problem Solving Protocol
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9851311/
https://www.ncbi.nlm.nih.gov/pubmed/36642409
http://dx.doi.org/10.1093/bib/bbac619
work_keys_str_mv AT yeungwayland treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT zhouzhongliang treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT mathewliju treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT gravelnathan treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT taujalerahil treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT oboylebrady treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT salcedomariah treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT venkataarya treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT lanzilottawilliam treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT lisheng treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies
AT kannannatarajan treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies