Cargando…

Nucleotide augmentation for machine learning-guided protein engineering

SUMMARY: Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data...

Descripción completa

Detalles Bibliográficos
Autores principales:	Minot, Mason, Reddy, Sai T
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2022
Materias:	Original Paper
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9843584/ https://www.ncbi.nlm.nih.gov/pubmed/36698759 http://dx.doi.org/10.1093/bioadv/vbac094

_version_	1784870443662966784
author	Minot, Mason Reddy, Sai T
author_facet	Minot, Mason Reddy, Sai T
author_sort	Minot, Mason
collection	PubMed
description	SUMMARY: Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data are often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing; however, there is a lack of such augmentation techniques for biological sequence data. Towards this end, we develop nucleotide augmentation (NTA), which leverages natural nucleotide codon degeneracy to augment protein sequence data via synonymous codon substitution. As a proof of concept for protein engineering, we test several online and offline augmentation implementations to train machine learning models with benchmark datasets of protein genotype and phenotype, revealing performance gains on par and surpassing benchmark models using a fraction of the training data. NTA also enables substantial improvements for classification tasks under heavy class imbalance. AVAILABILITY AND IMPLEMENTATION: The code used in this study is publicly available at https://github.com/minotm/NTA SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
format	Online Article Text
id	pubmed-9843584
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-98435842023-01-24 Nucleotide augmentation for machine learning-guided protein engineering Minot, Mason Reddy, Sai T Bioinform Adv Original Paper SUMMARY: Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances, collecting protein genotype (sequence) and phenotype (function) data remains time- and resource-intensive. As a result, the quality and quantity of training data are often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing; however, there is a lack of such augmentation techniques for biological sequence data. Towards this end, we develop nucleotide augmentation (NTA), which leverages natural nucleotide codon degeneracy to augment protein sequence data via synonymous codon substitution. As a proof of concept for protein engineering, we test several online and offline augmentation implementations to train machine learning models with benchmark datasets of protein genotype and phenotype, revealing performance gains on par and surpassing benchmark models using a fraction of the training data. NTA also enables substantial improvements for classification tasks under heavy class imbalance. AVAILABILITY AND IMPLEMENTATION: The code used in this study is publicly available at https://github.com/minotm/NTA SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2022-12-09 /pmc/articles/PMC9843584/ /pubmed/36698759 http://dx.doi.org/10.1093/bioadv/vbac094 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Paper Minot, Mason Reddy, Sai T Nucleotide augmentation for machine learning-guided protein engineering
title	Nucleotide augmentation for machine learning-guided protein engineering
title_full	Nucleotide augmentation for machine learning-guided protein engineering
title_fullStr	Nucleotide augmentation for machine learning-guided protein engineering
title_full_unstemmed	Nucleotide augmentation for machine learning-guided protein engineering
title_short	Nucleotide augmentation for machine learning-guided protein engineering
title_sort	nucleotide augmentation for machine learning-guided protein engineering
topic	Original Paper
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9843584/ https://www.ncbi.nlm.nih.gov/pubmed/36698759 http://dx.doi.org/10.1093/bioadv/vbac094
work_keys_str_mv	AT minotmason nucleotideaugmentationformachinelearningguidedproteinengineering AT reddysait nucleotideaugmentationformachinelearningguidedproteinengineering

Nucleotide augmentation for machine learning-guided protein engineering

Ejemplares similares