Cargando…

A bacterial phyla dataset for protein function prediction

Protein function prediction has been the most worked upon and the most challenging problem for computational biologists. The vast majority of known proteins have yet not been characterised experimentally, and there is significant gap between their structures and functions. New un-annotated sequences...

Descripción completa

Detalles Bibliográficos
Autores principales: Mishra, Sarthak, Rastogi, Yash Pratap, Jabin, Suraiya, Kaur, Punit, Amir, Mohammad, Khatoon, Shabanam
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6950771/
https://www.ncbi.nlm.nih.gov/pubmed/31921945
http://dx.doi.org/10.1016/j.dib.2019.105002
_version_ 1783486148597776384
author Mishra, Sarthak
Rastogi, Yash Pratap
Jabin, Suraiya
Kaur, Punit
Amir, Mohammad
Khatoon, Shabanam
author_facet Mishra, Sarthak
Rastogi, Yash Pratap
Jabin, Suraiya
Kaur, Punit
Amir, Mohammad
Khatoon, Shabanam
author_sort Mishra, Sarthak
collection PubMed
description Protein function prediction has been the most worked upon and the most challenging problem for computational biologists. The vast majority of known proteins have yet not been characterised experimentally, and there is significant gap between their structures and functions. New un-annotated sequences are being added to the public protein databases (e.g. UniprotKB) at an enormous pace [1]. Such proteins with unknown functions might play key role in the metabolism, growth and development regulation. Thus, if functions of unknown proteins left undiscovered, researchers may skip important information(s). Based on their sequence, structure, evolutionary history, and their association with other proteins, tools of computational biology can provide insights into the function of proteins [2]. For proteins with well characterised close relatives, it is trivial to infer function. Orphan proteins without discernible sequence relatives present a greater challenge [3]. Here the task of experimental characterisation is blind and becomes unwieldy. It is highly unlikely that all known proteins will ever be completely experimentally characterised [4]. Thus, there is an emergent need to develop fast and accurate computational approaches to fulfil this requirement. Towards this end, we prepared a dataset for protein function prediction by extracting protein sequences and annotations of reviewed prokaryotic proteins (total count 323,719 as accessed on date March 10, 2019) belonging to 9 bacterial phyla Actinobacteria, Bacteroidetes, Chlamydiae, Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, Spirochaetes and Tenericutes. Corresponding to the most frequent 1739 Gene Ontology (Molecular Function) terms, samples were filtered, and 171,212 proteins were retrieved for feature generation. The Dataset was generated by calculating the sequence, sub-sequence, physiochemical, annotation-based features for each 171,212 reviewed proteins using method in [10]. These features constitute a total of 9890 attributes for each sequence of protein along with 1739 Gene Ontology terms. Each protein sequence is assigned one or more of 1739 Gene Ontology (Molecular Function) term as its target label. The Dataset contains the Entry and Entry name of each sequence corresponding to UniprotKB Database. This dataset being huge in size (171,212 samples X 9890 features, 1739 classes with multiple values) and equipped with enough number of positive and negative samples of each 1739 class, is good for testing efficiency of any upcoming deep learning models [5]. We divided the full dataset of 171,212 reviewed proteins in the ratio 3:1 to form Train/Test dataset 1; train dataset with 128,409 samples and test dataset with 42,803 samples to facilitate training of a deep learning model. The train and test datasets are stratified to contain good proportion of each 1739 classes. We then prepared a dataset 2 of pathogenic unreviewed proteins of the 9 bacterial phyla each with 9890 features same as train/train dataset of reviewed proteins but without target labels in order to predict their functions using deep learning model proposed in [5].
format Online
Article
Text
id pubmed-6950771
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-69507712020-01-09 A bacterial phyla dataset for protein function prediction Mishra, Sarthak Rastogi, Yash Pratap Jabin, Suraiya Kaur, Punit Amir, Mohammad Khatoon, Shabanam Data Brief Biochemistry, Genetics and Molecular Biology Protein function prediction has been the most worked upon and the most challenging problem for computational biologists. The vast majority of known proteins have yet not been characterised experimentally, and there is significant gap between their structures and functions. New un-annotated sequences are being added to the public protein databases (e.g. UniprotKB) at an enormous pace [1]. Such proteins with unknown functions might play key role in the metabolism, growth and development regulation. Thus, if functions of unknown proteins left undiscovered, researchers may skip important information(s). Based on their sequence, structure, evolutionary history, and their association with other proteins, tools of computational biology can provide insights into the function of proteins [2]. For proteins with well characterised close relatives, it is trivial to infer function. Orphan proteins without discernible sequence relatives present a greater challenge [3]. Here the task of experimental characterisation is blind and becomes unwieldy. It is highly unlikely that all known proteins will ever be completely experimentally characterised [4]. Thus, there is an emergent need to develop fast and accurate computational approaches to fulfil this requirement. Towards this end, we prepared a dataset for protein function prediction by extracting protein sequences and annotations of reviewed prokaryotic proteins (total count 323,719 as accessed on date March 10, 2019) belonging to 9 bacterial phyla Actinobacteria, Bacteroidetes, Chlamydiae, Cyanobacteria, Firmicutes, Fusobacteria, Proteobacteria, Spirochaetes and Tenericutes. Corresponding to the most frequent 1739 Gene Ontology (Molecular Function) terms, samples were filtered, and 171,212 proteins were retrieved for feature generation. The Dataset was generated by calculating the sequence, sub-sequence, physiochemical, annotation-based features for each 171,212 reviewed proteins using method in [10]. These features constitute a total of 9890 attributes for each sequence of protein along with 1739 Gene Ontology terms. Each protein sequence is assigned one or more of 1739 Gene Ontology (Molecular Function) term as its target label. The Dataset contains the Entry and Entry name of each sequence corresponding to UniprotKB Database. This dataset being huge in size (171,212 samples X 9890 features, 1739 classes with multiple values) and equipped with enough number of positive and negative samples of each 1739 class, is good for testing efficiency of any upcoming deep learning models [5]. We divided the full dataset of 171,212 reviewed proteins in the ratio 3:1 to form Train/Test dataset 1; train dataset with 128,409 samples and test dataset with 42,803 samples to facilitate training of a deep learning model. The train and test datasets are stratified to contain good proportion of each 1739 classes. We then prepared a dataset 2 of pathogenic unreviewed proteins of the 9 bacterial phyla each with 9890 features same as train/train dataset of reviewed proteins but without target labels in order to predict their functions using deep learning model proposed in [5]. Elsevier 2019-12-18 /pmc/articles/PMC6950771/ /pubmed/31921945 http://dx.doi.org/10.1016/j.dib.2019.105002 Text en © 2019 The Author(s) http://creativecommons.org/licenses/by/4.0/ This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Biochemistry, Genetics and Molecular Biology
Mishra, Sarthak
Rastogi, Yash Pratap
Jabin, Suraiya
Kaur, Punit
Amir, Mohammad
Khatoon, Shabanam
A bacterial phyla dataset for protein function prediction
title A bacterial phyla dataset for protein function prediction
title_full A bacterial phyla dataset for protein function prediction
title_fullStr A bacterial phyla dataset for protein function prediction
title_full_unstemmed A bacterial phyla dataset for protein function prediction
title_short A bacterial phyla dataset for protein function prediction
title_sort bacterial phyla dataset for protein function prediction
topic Biochemistry, Genetics and Molecular Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6950771/
https://www.ncbi.nlm.nih.gov/pubmed/31921945
http://dx.doi.org/10.1016/j.dib.2019.105002
work_keys_str_mv AT mishrasarthak abacterialphyladatasetforproteinfunctionprediction
AT rastogiyashpratap abacterialphyladatasetforproteinfunctionprediction
AT jabinsuraiya abacterialphyladatasetforproteinfunctionprediction
AT kaurpunit abacterialphyladatasetforproteinfunctionprediction
AT amirmohammad abacterialphyladatasetforproteinfunctionprediction
AT khatoonshabanam abacterialphyladatasetforproteinfunctionprediction
AT mishrasarthak bacterialphyladatasetforproteinfunctionprediction
AT rastogiyashpratap bacterialphyladatasetforproteinfunctionprediction
AT jabinsuraiya bacterialphyladatasetforproteinfunctionprediction
AT kaurpunit bacterialphyladatasetforproteinfunctionprediction
AT amirmohammad bacterialphyladatasetforproteinfunctionprediction
AT khatoonshabanam bacterialphyladatasetforproteinfunctionprediction