Cargando…
A top-down approach to classify enzyme functional classes and sub-classes using random forest
Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be tim...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2012
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3351021/ https://www.ncbi.nlm.nih.gov/pubmed/22376768 http://dx.doi.org/10.1186/1687-4153-2012-1 |
_version_ | 1782232728052170752 |
---|---|
author | Kumar, Chetan Choudhary, Alok |
author_facet | Kumar, Chetan Choudhary, Alok |
author_sort | Kumar, Chetan |
collection | PubMed |
description | Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class of enzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes. |
format | Online Article Text |
id | pubmed-3351021 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2012 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-33510212012-05-15 A top-down approach to classify enzyme functional classes and sub-classes using random forest Kumar, Chetan Choudhary, Alok EURASIP J Bioinform Syst Biol Research Advancements in sequencing technologies have witnessed an exponential rise in the number of newly found enzymes. Enzymes are proteins that catalyze bio-chemical reactions and play an important role in metabolic pathways. Commonly, function of such enzymes is determined by experiments that can be time consuming and costly. Hence, a need for a computing method is felt that can distinguish protein enzyme sequences from those of non-enzymes and reliably predict the function of the former. To address this problem, approaches that cluster enzymes based on their sequence and structural similarity have been presented. But, these approaches are known to fail for proteins that perform the same function and are dissimilar in their sequence and structure. In this article, we present a supervised machine learning model to predict the function class and sub-class of enzymes based on a set of 73 sequence-derived features. The functional classes are as defined by International Union of Biochemistry and Molecular Biology. Using an efficient data mining algorithm called random forest, we construct a top-down three layer model where the top layer classifies a query protein sequence as an enzyme or non-enzyme, the second layer predicts the main function class and bottom layer further predicts the sub-function class. The model reported overall classification accuracy of 94.87% for the first level, 87.7% for the second, and 84.25% for the bottom level. Our results compare very well with existing methods, and in many cases report better performance. Using feature selection methods, we have shown the biological relevance of a few of the top rank attributes. BioMed Central 2012 2012-02-29 /pmc/articles/PMC3351021/ /pubmed/22376768 http://dx.doi.org/10.1186/1687-4153-2012-1 Text en Copyright ©2012 Kumar and Choudhary; licensee Springer. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Kumar, Chetan Choudhary, Alok A top-down approach to classify enzyme functional classes and sub-classes using random forest |
title | A top-down approach to classify enzyme functional classes and sub-classes using random forest |
title_full | A top-down approach to classify enzyme functional classes and sub-classes using random forest |
title_fullStr | A top-down approach to classify enzyme functional classes and sub-classes using random forest |
title_full_unstemmed | A top-down approach to classify enzyme functional classes and sub-classes using random forest |
title_short | A top-down approach to classify enzyme functional classes and sub-classes using random forest |
title_sort | top-down approach to classify enzyme functional classes and sub-classes using random forest |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3351021/ https://www.ncbi.nlm.nih.gov/pubmed/22376768 http://dx.doi.org/10.1186/1687-4153-2012-1 |
work_keys_str_mv | AT kumarchetan atopdownapproachtoclassifyenzymefunctionalclassesandsubclassesusingrandomforest AT choudharyalok atopdownapproachtoclassifyenzymefunctionalclassesandsubclassesusingrandomforest AT kumarchetan topdownapproachtoclassifyenzymefunctionalclassesandsubclassesusingrandomforest AT choudharyalok topdownapproachtoclassifyenzymefunctionalclassesandsubclassesusingrandomforest |