Cargando…

A top-down approach to classify enzyme functional classes and sub-classes using random forest

Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be tim...

Descripción completa

Detalles Bibliográficos
Autores principales: Kumar, Chetan, Choudhary, Alok
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2012
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3351021/
https://www.ncbi.nlm.nih.gov/pubmed/22376768
http://dx.doi.org/10.1186/1687-4153-2012-1
_version_ 1782232728052170752
author Kumar, Chetan
Choudhary, Alok
author_facet Kumar, Chetan
Choudhary, Alok
author_sort Kumar, Chetan
collection PubMed
description Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class of enzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes.
format Online
Article
Text
id pubmed-3351021
institution National Center for Biotechnology Information
language English
publishDate 2012
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-33510212012-05-15 A top-down approach to classify enzyme functional classes and sub-classes using random forest Kumar, Chetan Choudhary, Alok EURASIP J Bioinform Syst Biol Research Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class of enzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes. BioMed Central 2012 2012-02-29 /pmc/articles/PMC3351021/ /pubmed/22376768 http://dx.doi.org/10.1186/1687-4153-2012-1 Text en Copyright ©2012 Kumar and Choudhary; licensee Springer. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Kumar, Chetan
Choudhary, Alok
A top-down approach to classify enzyme functional classes and sub-classes using random forest
title A top-down approach to classify enzyme functional classes and sub-classes using random forest
title_full A top-down approach to classify enzyme functional classes and sub-classes using random forest
title_fullStr A top-down approach to classify enzyme functional classes and sub-classes using random forest
title_full_unstemmed A top-down approach to classify enzyme functional classes and sub-classes using random forest
title_short A top-down approach to classify enzyme functional classes and sub-classes using random forest
title_sort top-down approach to classify enzyme functional classes and sub-classes using random forest
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3351021/
https://www.ncbi.nlm.nih.gov/pubmed/22376768
http://dx.doi.org/10.1186/1687-4153-2012-1
work_keys_str_mv AT kumarchetan atopdownapproachtoclassifyenzymefunctionalclassesandsubclassesusingrandomforest
AT choudharyalok atopdownapproachtoclassifyenzymefunctionalclassesandsubclassesusingrandomforest
AT kumarchetan topdownapproachtoclassifyenzymefunctionalclassesandsubclassesusingrandomforest
AT choudharyalok topdownapproachtoclassifyenzymefunctionalclassesandsubclassesusingrandomforest