Cargando…

Searching for protein variants with desired properties using deep generative models

BACKGROUND: Protein engineering aims to improve the functional properties of existing proteins to meet people’s needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need t...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Yan, Yao, Yinying, Xia, Yu, Tang, Mingjing
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10362698/
https://www.ncbi.nlm.nih.gov/pubmed/37480001
http://dx.doi.org/10.1186/s12859-023-05415-9
_version_ 1785076484081188864
author Li, Yan
Yao, Yinying
Xia, Yu
Tang, Mingjing
author_facet Li, Yan
Yao, Yinying
Xia, Yu
Tang, Mingjing
author_sort Li, Yan
collection PubMed
description BACKGROUND: Protein engineering aims to improve the functional properties of existing proteins to meet people’s needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties. RESULTS: To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence. CONCLUSION: Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model’s generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model.
format Online
Article
Text
id pubmed-10362698
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-103626982023-07-23 Searching for protein variants with desired properties using deep generative models Li, Yan Yao, Yinying Xia, Yu Tang, Mingjing BMC Bioinformatics Research BACKGROUND: Protein engineering aims to improve the functional properties of existing proteins to meet people’s needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties. RESULTS: To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence. CONCLUSION: Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model’s generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model. BioMed Central 2023-07-21 /pmc/articles/PMC10362698/ /pubmed/37480001 http://dx.doi.org/10.1186/s12859-023-05415-9 Text en © The Author(s) 2023 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle Research
Li, Yan
Yao, Yinying
Xia, Yu
Tang, Mingjing
Searching for protein variants with desired properties using deep generative models
title Searching for protein variants with desired properties using deep generative models
title_full Searching for protein variants with desired properties using deep generative models
title_fullStr Searching for protein variants with desired properties using deep generative models
title_full_unstemmed Searching for protein variants with desired properties using deep generative models
title_short Searching for protein variants with desired properties using deep generative models
title_sort searching for protein variants with desired properties using deep generative models
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10362698/
https://www.ncbi.nlm.nih.gov/pubmed/37480001
http://dx.doi.org/10.1186/s12859-023-05415-9
work_keys_str_mv AT liyan searchingforproteinvariantswithdesiredpropertiesusingdeepgenerativemodels
AT yaoyinying searchingforproteinvariantswithdesiredpropertiesusingdeepgenerativemodels
AT xiayu searchingforproteinvariantswithdesiredpropertiesusingdeepgenerativemodels
AT tangmingjing searchingforproteinvariantswithdesiredpropertiesusingdeepgenerativemodels