Cargando…
Searching for protein variants with desired properties using deep generative models
BACKGROUND: Protein engineering aims to improve the functional properties of existing proteins to meet people’s needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need t...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10362698/ https://www.ncbi.nlm.nih.gov/pubmed/37480001 http://dx.doi.org/10.1186/s12859-023-05415-9 |
_version_ | 1785076484081188864 |
---|---|
author | Li, Yan Yao, Yinying Xia, Yu Tang, Mingjing |
author_facet | Li, Yan Yao, Yinying Xia, Yu Tang, Mingjing |
author_sort | Li, Yan |
collection | PubMed |
description | BACKGROUND: Protein engineering aims to improve the functional properties of existing proteins to meet people’s needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties. RESULTS: To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence. CONCLUSION: Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model’s generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model. |
format | Online Article Text |
id | pubmed-10362698 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-103626982023-07-23 Searching for protein variants with desired properties using deep generative models Li, Yan Yao, Yinying Xia, Yu Tang, Mingjing BMC Bioinformatics Research BACKGROUND: Protein engineering aims to improve the functional properties of existing proteins to meet people’s needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties. RESULTS: To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence. CONCLUSION: Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model’s generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model. BioMed Central 2023-07-21 /pmc/articles/PMC10362698/ /pubmed/37480001 http://dx.doi.org/10.1186/s12859-023-05415-9 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data. |
spellingShingle | Research Li, Yan Yao, Yinying Xia, Yu Tang, Mingjing Searching for protein variants with desired properties using deep generative models |
title | Searching for protein variants with desired properties using deep generative models |
title_full | Searching for protein variants with desired properties using deep generative models |
title_fullStr | Searching for protein variants with desired properties using deep generative models |
title_full_unstemmed | Searching for protein variants with desired properties using deep generative models |
title_short | Searching for protein variants with desired properties using deep generative models |
title_sort | searching for protein variants with desired properties using deep generative models |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10362698/ https://www.ncbi.nlm.nih.gov/pubmed/37480001 http://dx.doi.org/10.1186/s12859-023-05415-9 |
work_keys_str_mv | AT liyan searchingforproteinvariantswithdesiredpropertiesusingdeepgenerativemodels AT yaoyinying searchingforproteinvariantswithdesiredpropertiesusingdeepgenerativemodels AT xiayu searchingforproteinvariantswithdesiredpropertiesusingdeepgenerativemodels AT tangmingjing searchingforproteinvariantswithdesiredpropertiesusingdeepgenerativemodels |