Cargando…

ProtPlat: an efficient pre-training platform for protein classification based on FastText

BACKGROUND: For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performan...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jin, Yuan, Yang, Yang
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2022
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8832758/ https://www.ncbi.nlm.nih.gov/pubmed/35148686 http://dx.doi.org/10.1186/s12859-022-04604-2

_version_	1784648785721294848
author	Jin, Yuan Yang, Yang
author_facet	Jin, Yuan Yang, Yang
author_sort	Jin, Yuan
collection	PubMed
description	BACKGROUND: For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few. RESULTS: In this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service (https://compbio.sjtu.edu.cn/protplat) that is accessible to the public. CONCLUSIONS: To enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04604-2.
format	Online Article Text
id	pubmed-8832758
institution	National Center for Biotechnology Information
language	English
publishDate	2022
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-88327582022-02-15 ProtPlat: an efficient pre-training platform for protein classification based on FastText Jin, Yuan Yang, Yang BMC Bioinformatics Research BACKGROUND: For the past decades, benefitting from the rapid growth of protein sequence data in public databases, a lot of machine learning methods have been developed to predict physicochemical properties or functions of proteins using amino acid sequence features. However, the prediction performance often suffers from the lack of labeled data. In recent years, pre-training methods have been widely studied to address the small-sample issue in computer vision and natural language processing fields, while specific pre-training techniques for protein sequences are few. RESULTS: In this paper, we propose a pre-training platform for representing protein sequences, called ProtPlat, which uses the Pfam database to train a three-layer neural network, and then uses specific training data from downstream tasks to fine-tune the model. ProtPlat can learn good representations for amino acids, and at the same time achieve efficient classification. We conduct experiments on three protein classification tasks, including the identification of type III secreted effectors, the prediction of subcellular localization, and the recognition of signal peptides. The experimental results show that the pre-training can enhance model performance effectively and ProtPlat is competitive to the state-of-the-art predictors, especially for small datasets. We implement the ProtPlat platform as a web service (https://compbio.sjtu.edu.cn/protplat) that is accessible to the public. CONCLUSIONS: To enhance the feature representation of protein amino acid sequences and improve the performance of sequence-based classification tasks, we develop ProtPlat, a general platform for the pre-training of protein sequences, which is featured by a large-scale supervised training based on Pfam database and an efficient learning model, FastText. The experimental results of three downstream classification tasks demonstrate the efficacy of ProtPlat. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s12859-022-04604-2. BioMed Central 2022-02-11 /pmc/articles/PMC8832758/ /pubmed/35148686 http://dx.doi.org/10.1186/s12859-022-04604-2 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open AccessThis article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/ (https://creativecommons.org/publicdomain/zero/1.0/) ) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
spellingShingle	Research Jin, Yuan Yang, Yang ProtPlat: an efficient pre-training platform for protein classification based on FastText
title	ProtPlat: an efficient pre-training platform for protein classification based on FastText
title_full	ProtPlat: an efficient pre-training platform for protein classification based on FastText
title_fullStr	ProtPlat: an efficient pre-training platform for protein classification based on FastText
title_full_unstemmed	ProtPlat: an efficient pre-training platform for protein classification based on FastText
title_short	ProtPlat: an efficient pre-training platform for protein classification based on FastText
title_sort	protplat: an efficient pre-training platform for protein classification based on fasttext
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8832758/ https://www.ncbi.nlm.nih.gov/pubmed/35148686 http://dx.doi.org/10.1186/s12859-022-04604-2
work_keys_str_mv	AT jinyuan protplatanefficientpretrainingplatformforproteinclassificationbasedonfasttext AT yangyang protplatanefficientpretrainingplatformforproteinclassificationbasedonfasttext

ProtPlat: an efficient pre-training platform for protein classification based on FastText

Ejemplares similares