Cargando…

Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences

Interpreting protein function from sequence data is a fundamental goal of bioinformatics. However, our current understanding of protein diversity is bottlenecked by the fact that most proteins have only been functionally validated in model organisms, limiting our understanding of how function varies...

Descripción completa

Detalles Bibliográficos
Autores principales: David, Kyle T, Halanych, Kenneth M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10231473/
https://www.ncbi.nlm.nih.gov/pubmed/37217837
http://dx.doi.org/10.1093/gbe/evad084
_version_ 1785051763995312128
author David, Kyle T
Halanych, Kenneth M
author_facet David, Kyle T
Halanych, Kenneth M
author_sort David, Kyle T
collection PubMed
description Interpreting protein function from sequence data is a fundamental goal of bioinformatics. However, our current understanding of protein diversity is bottlenecked by the fact that most proteins have only been functionally validated in model organisms, limiting our understanding of how function varies with gene sequence diversity. Thus, accuracy of inferences in clades without model representatives is questionable. Unsupervised learning may help to ameliorate this bias by identifying highly complex patterns and structure from large data sets without external labels. Here, we present DeepSeqProt, an unsupervised deep learning program for exploring large protein sequence data sets. DeepSeqProt is a clustering tool capable of distinguishing between broad classes of proteins while learning local and global structure of functional space. DeepSeqProt is capable of learning salient biological features from unaligned, unannotated sequences. DeepSeqProt is more likely to capture complete protein families and statistically significant shared ontologies within proteomes than other clustering methods. We hope this framework will prove of use to researchers and provide a preliminary step in further developing unsupervised deep learning in molecular biology.
format Online
Article
Text
id pubmed-10231473
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-102314732023-06-01 Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences David, Kyle T Halanych, Kenneth M Genome Biol Evol Article Interpreting protein function from sequence data is a fundamental goal of bioinformatics. However, our current understanding of protein diversity is bottlenecked by the fact that most proteins have only been functionally validated in model organisms, limiting our understanding of how function varies with gene sequence diversity. Thus, accuracy of inferences in clades without model representatives is questionable. Unsupervised learning may help to ameliorate this bias by identifying highly complex patterns and structure from large data sets without external labels. Here, we present DeepSeqProt, an unsupervised deep learning program for exploring large protein sequence data sets. DeepSeqProt is a clustering tool capable of distinguishing between broad classes of proteins while learning local and global structure of functional space. DeepSeqProt is capable of learning salient biological features from unaligned, unannotated sequences. DeepSeqProt is more likely to capture complete protein families and statistically significant shared ontologies within proteomes than other clustering methods. We hope this framework will prove of use to researchers and provide a preliminary step in further developing unsupervised deep learning in molecular biology. Oxford University Press 2023-05-22 /pmc/articles/PMC10231473/ /pubmed/37217837 http://dx.doi.org/10.1093/gbe/evad084 Text en © The Author(s) 2023. Published by Oxford University Press on behalf of Society for Molecular Biology and Evolution. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Article
David, Kyle T
Halanych, Kenneth M
Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences
title Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences
title_full Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences
title_fullStr Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences
title_full_unstemmed Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences
title_short Unsupervised Deep Learning Can Identify Protein Functional Groups from Unaligned Sequences
title_sort unsupervised deep learning can identify protein functional groups from unaligned sequences
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10231473/
https://www.ncbi.nlm.nih.gov/pubmed/37217837
http://dx.doi.org/10.1093/gbe/evad084
work_keys_str_mv AT davidkylet unsuperviseddeeplearningcanidentifyproteinfunctionalgroupsfromunalignedsequences
AT halanychkennethm unsuperviseddeeplearningcanidentifyproteinfunctionalgroupsfromunalignedsequences