Cargando…
Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies
Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primar...
Autores principales: | , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9851311/ https://www.ncbi.nlm.nih.gov/pubmed/36642409 http://dx.doi.org/10.1093/bib/bbac619 |
_version_ | 1784872369083383808 |
---|---|
author | Yeung, Wayland Zhou, Zhongliang Mathew, Liju Gravel, Nathan Taujale, Rahil O’Boyle, Brady Salcedo, Mariah Venkat, Aarya Lanzilotta, William Li, Sheng Kannan, Natarajan |
author_facet | Yeung, Wayland Zhou, Zhongliang Mathew, Liju Gravel, Nathan Taujale, Rahil O’Boyle, Brady Salcedo, Mariah Venkat, Aarya Lanzilotta, William Li, Sheng Kannan, Natarajan |
author_sort | Yeung, Wayland |
collection | PubMed |
description | Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primary sequence alone. While sequence embeddings have been applied toward tasks such as structure and function prediction, applications toward alignment-free sequence classification have been hindered by the lack of studies to derive, quantify and evaluate relationships between protein sequence embeddings. Here, we develop workflows and visualization methods for the classification of protein families using sequence embedding derived from protein language models. A benchmark of manifold visualization methods reveals that Neighbor Joining (NJ) embedding trees are highly effective in capturing global structure while achieving similar performance in capturing local structure compared with popular dimensionality reduction techniques such as t-SNE and UMAP. The statistical significance of hierarchical clusters on a tree is evaluated by resampling embeddings using a variational autoencoder (VAE). We demonstrate the application of our methods in the classification of two well-studied enzyme superfamilies, phosphatases and protein kinases. Our embedding-based classifications remain consistent with and extend upon previously published sequence alignment-based classifications. We also propose a new hierarchical classification for the S-Adenosyl-L-Methionine (SAM) enzyme superfamily which has been difficult to classify using traditional alignment-based approaches. Beyond applications in sequence classification, our results further suggest NJ trees are a promising general method for visualizing high-dimensional data sets. |
format | Online Article Text |
id | pubmed-9851311 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-98513112023-01-20 Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies Yeung, Wayland Zhou, Zhongliang Mathew, Liju Gravel, Nathan Taujale, Rahil O’Boyle, Brady Salcedo, Mariah Venkat, Aarya Lanzilotta, William Li, Sheng Kannan, Natarajan Brief Bioinform Problem Solving Protocol Protein language models, trained on millions of biologically observed sequences, generate feature-rich numerical representations of protein sequences. These representations, called sequence embeddings, can infer structure-functional properties, despite protein language models being trained on primary sequence alone. While sequence embeddings have been applied toward tasks such as structure and function prediction, applications toward alignment-free sequence classification have been hindered by the lack of studies to derive, quantify and evaluate relationships between protein sequence embeddings. Here, we develop workflows and visualization methods for the classification of protein families using sequence embedding derived from protein language models. A benchmark of manifold visualization methods reveals that Neighbor Joining (NJ) embedding trees are highly effective in capturing global structure while achieving similar performance in capturing local structure compared with popular dimensionality reduction techniques such as t-SNE and UMAP. The statistical significance of hierarchical clusters on a tree is evaluated by resampling embeddings using a variational autoencoder (VAE). We demonstrate the application of our methods in the classification of two well-studied enzyme superfamilies, phosphatases and protein kinases. Our embedding-based classifications remain consistent with and extend upon previously published sequence alignment-based classifications. We also propose a new hierarchical classification for the S-Adenosyl-L-Methionine (SAM) enzyme superfamily which has been difficult to classify using traditional alignment-based approaches. Beyond applications in sequence classification, our results further suggest NJ trees are a promising general method for visualizing high-dimensional data sets. Oxford University Press 2023-01-15 /pmc/articles/PMC9851311/ /pubmed/36642409 http://dx.doi.org/10.1093/bib/bbac619 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Problem Solving Protocol Yeung, Wayland Zhou, Zhongliang Mathew, Liju Gravel, Nathan Taujale, Rahil O’Boyle, Brady Salcedo, Mariah Venkat, Aarya Lanzilotta, William Li, Sheng Kannan, Natarajan Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies |
title | Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies |
title_full | Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies |
title_fullStr | Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies |
title_full_unstemmed | Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies |
title_short | Tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies |
title_sort | tree visualizations of protein sequence embedding space enable improved functional clustering of diverse protein superfamilies |
topic | Problem Solving Protocol |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9851311/ https://www.ncbi.nlm.nih.gov/pubmed/36642409 http://dx.doi.org/10.1093/bib/bbac619 |
work_keys_str_mv | AT yeungwayland treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT zhouzhongliang treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT mathewliju treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT gravelnathan treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT taujalerahil treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT oboylebrady treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT salcedomariah treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT venkataarya treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT lanzilottawilliam treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT lisheng treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies AT kannannatarajan treevisualizationsofproteinsequenceembeddingspaceenableimprovedfunctionalclusteringofdiverseproteinsuperfamilies |