Cargando…

MiPepid: MicroPeptide identification tool using machine learning

BACKGROUND: Micropeptides are small proteins with length < = 100 amino acids. Short open reading frames that could produces micropeptides were traditionally ignored due to technical difficulties, as few small peptides had been experimentally confirmed. In the past decade, a growing number of micr...

Descripción completa

Detalles Bibliográficos
Autores principales:	Zhu, Mengmeng, Gribskov, Michael
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Software
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6842143/ https://www.ncbi.nlm.nih.gov/pubmed/31703551 http://dx.doi.org/10.1186/s12859-019-3033-9

_version_	1783467989039841280
author	Zhu, Mengmeng Gribskov, Michael
author_facet	Zhu, Mengmeng Gribskov, Michael
author_sort	Zhu, Mengmeng
collection	PubMed
description	BACKGROUND: Micropeptides are small proteins with length < = 100 amino acids. Short open reading frames that could produces micropeptides were traditionally ignored due to technical difficulties, as few small peptides had been experimentally confirmed. In the past decade, a growing number of micropeptides have been shown to play significant roles in vital biological activities. Despite the increased amount of data, we still lack bioinformatics tools for specifically identifying micropeptides from DNA sequences. Indeed, most existing tools for classifying coding and noncoding ORFs were built on datasets in which “normal-sized” proteins were considered to be positives and short ORFs were generally considered to be noncoding. Since the functional and biophysical constraints on small peptides are likely to be different from those on “normal” proteins, methods for predicting short translated ORFs must be trained independently from those for longer proteins. RESULTS: In this study, we have developed MiPepid, a machine-learning tool specifically for the identification of micropeptides. We trained MiPepid using carefully cleaned data from existing databases and used logistic regression with 4-mer features. With only the sequence information of an ORF, MiPepid is able to predict whether it encodes a micropeptide with 96% accuracy on a blind dataset of high-confidence micropeptides, and to correctly classify newly discovered micropeptides not included in either the training or the blind test data. Compared with state-of-the-art coding potential prediction methods, MiPepid performs exceptionally well, as other methods incorrectly classify most bona fide micropeptides as noncoding. MiPepid is alignment-free and runs sufficiently fast for genome-scale analyses. It is easy to use and is available at https://github.com/MindAI/MiPepid. CONCLUSIONS: MiPepid was developed to specifically predict micropeptides, a category of proteins with increasing significance, from DNA sequences. It shows evident advantages over existing coding potential prediction methods on micropeptide identification. It is ready to use and runs fast. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-3033-9) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-6842143
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-68421432019-11-14 MiPepid: MicroPeptide identification tool using machine learning Zhu, Mengmeng Gribskov, Michael BMC Bioinformatics Software BACKGROUND: Micropeptides are small proteins with length < = 100 amino acids. Short open reading frames that could produces micropeptides were traditionally ignored due to technical difficulties, as few small peptides had been experimentally confirmed. In the past decade, a growing number of micropeptides have been shown to play significant roles in vital biological activities. Despite the increased amount of data, we still lack bioinformatics tools for specifically identifying micropeptides from DNA sequences. Indeed, most existing tools for classifying coding and noncoding ORFs were built on datasets in which “normal-sized” proteins were considered to be positives and short ORFs were generally considered to be noncoding. Since the functional and biophysical constraints on small peptides are likely to be different from those on “normal” proteins, methods for predicting short translated ORFs must be trained independently from those for longer proteins. RESULTS: In this study, we have developed MiPepid, a machine-learning tool specifically for the identification of micropeptides. We trained MiPepid using carefully cleaned data from existing databases and used logistic regression with 4-mer features. With only the sequence information of an ORF, MiPepid is able to predict whether it encodes a micropeptide with 96% accuracy on a blind dataset of high-confidence micropeptides, and to correctly classify newly discovered micropeptides not included in either the training or the blind test data. Compared with state-of-the-art coding potential prediction methods, MiPepid performs exceptionally well, as other methods incorrectly classify most bona fide micropeptides as noncoding. MiPepid is alignment-free and runs sufficiently fast for genome-scale analyses. It is easy to use and is available at https://github.com/MindAI/MiPepid. CONCLUSIONS: MiPepid was developed to specifically predict micropeptides, a category of proteins with increasing significance, from DNA sequences. It shows evident advantages over existing coding potential prediction methods on micropeptide identification. It is ready to use and runs fast. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s12859-019-3033-9) contains supplementary material, which is available to authorized users. BioMed Central 2019-11-08 /pmc/articles/PMC6842143/ /pubmed/31703551 http://dx.doi.org/10.1186/s12859-019-3033-9 Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Software Zhu, Mengmeng Gribskov, Michael MiPepid: MicroPeptide identification tool using machine learning
title	MiPepid: MicroPeptide identification tool using machine learning
title_full	MiPepid: MicroPeptide identification tool using machine learning
title_fullStr	MiPepid: MicroPeptide identification tool using machine learning
title_full_unstemmed	MiPepid: MicroPeptide identification tool using machine learning
title_short	MiPepid: MicroPeptide identification tool using machine learning
title_sort	mipepid: micropeptide identification tool using machine learning
topic	Software
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6842143/ https://www.ncbi.nlm.nih.gov/pubmed/31703551 http://dx.doi.org/10.1186/s12859-019-3033-9
work_keys_str_mv	AT zhumengmeng mipepidmicropeptideidentificationtoolusingmachinelearning AT gribskovmichael mipepidmicropeptideidentificationtoolusingmachinelearning

MiPepid: MicroPeptide identification tool using machine learning

Ejemplares similares