Cargando…
Generative models for protein sequence modeling: recent advances and future directions
The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental f...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10589401/ https://www.ncbi.nlm.nih.gov/pubmed/37864295 http://dx.doi.org/10.1093/bib/bbad358 |
_version_ | 1785123782387564544 |
---|---|
author | Mardikoraem, Mehrsa Wang, Zirui Pascual, Nathaniel Woldring, Daniel |
author_facet | Mardikoraem, Mehrsa Wang, Zirui Pascual, Nathaniel Woldring, Daniel |
author_sort | Mardikoraem, Mehrsa |
collection | PubMed |
description | The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field. |
format | Online Article Text |
id | pubmed-10589401 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-105894012023-10-22 Generative models for protein sequence modeling: recent advances and future directions Mardikoraem, Mehrsa Wang, Zirui Pascual, Nathaniel Woldring, Daniel Brief Bioinform Review The widespread adoption of high-throughput omics technologies has exponentially increased the amount of protein sequence data involved in many salient disease pathways and their respective therapeutics and diagnostics. Despite the availability of large-scale sequence data, the lack of experimental fitness annotations underpins the need for self-supervised and unsupervised machine learning (ML) methods. These techniques leverage the meaningful features encoded in abundant unlabeled sequences to accomplish complex protein engineering tasks. Proficiency in the rapidly evolving fields of protein engineering and generative AI is required to realize the full potential of ML models as a tool for protein fitness landscape navigation. Here, we support this work by (i) providing an overview of the architecture and mathematical details of the most successful ML models applicable to sequence data (e.g. variational autoencoders, autoregressive models, generative adversarial neural networks, and diffusion models), (ii) guiding how to effectively implement these models on protein sequence data to predict fitness or generate high-fitness sequences and (iii) highlighting several successful studies that implement these techniques in protein engineering (from paratope regions and subcellular localization prediction to high-fitness sequences and protein design rules generation). By providing a comprehensive survey of model details, novel architecture developments, comparisons of model applications, and current challenges, this study intends to provide structured guidance and robust framework for delivering a prospective outlook in the ML-driven protein engineering field. Oxford University Press 2023-10-20 /pmc/articles/PMC10589401/ /pubmed/37864295 http://dx.doi.org/10.1093/bib/bbad358 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Review Mardikoraem, Mehrsa Wang, Zirui Pascual, Nathaniel Woldring, Daniel Generative models for protein sequence modeling: recent advances and future directions |
title | Generative models for protein sequence modeling: recent advances and future directions |
title_full | Generative models for protein sequence modeling: recent advances and future directions |
title_fullStr | Generative models for protein sequence modeling: recent advances and future directions |
title_full_unstemmed | Generative models for protein sequence modeling: recent advances and future directions |
title_short | Generative models for protein sequence modeling: recent advances and future directions |
title_sort | generative models for protein sequence modeling: recent advances and future directions |
topic | Review |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10589401/ https://www.ncbi.nlm.nih.gov/pubmed/37864295 http://dx.doi.org/10.1093/bib/bbad358 |
work_keys_str_mv | AT mardikoraemmehrsa generativemodelsforproteinsequencemodelingrecentadvancesandfuturedirections AT wangzirui generativemodelsforproteinsequencemodelingrecentadvancesandfuturedirections AT pascualnathaniel generativemodelsforproteinsequencemodelingrecentadvancesandfuturedirections AT woldringdaniel generativemodelsforproteinsequencemodelingrecentadvancesandfuturedirections |