Cargando…

Automatic structure classification of small proteins using random forest

BACKGROUND: Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure element...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jain, Pooja, Hirst, Jonathan D
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2916923/ https://www.ncbi.nlm.nih.gov/pubmed/20594334 http://dx.doi.org/10.1186/1471-2105-11-364

_version_	1782185036719587328
author	Jain, Pooja Hirst, Jonathan D
author_facet	Jain, Pooja Hirst, Jonathan D
author_sort	Jain, Pooja
collection	PubMed
description	BACKGROUND: Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs. RESULTS: Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP Class, Fold, Super-family or Family levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases. CONCLUSIONS: The utility of random forest in classifying domains from the place-holder classes of SCOP to the true Class, Fold, Super-family or Family levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy.
format	Text
id	pubmed-2916923
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-29169232010-08-06 Automatic structure classification of small proteins using random forest Jain, Pooja Hirst, Jonathan D BMC Bioinformatics Research Article BACKGROUND: Random forest, an ensemble based supervised machine learning algorithm, is used to predict the SCOP structural classification for a target structure, based on the similarity of its structural descriptors to those of a template structure with an equal number of secondary structure elements (SSEs). An initial assessment of random forest is carried out for domains consisting of three SSEs. The usability of random forest in classifying larger domains is demonstrated by applying it to domains consisting of four, five and six SSEs. RESULTS: Random forest, trained on SCOP version 1.69, achieves a predictive accuracy of up to 94% on an independent and non-overlapping test set derived from SCOP version 1.73. For classification to the SCOP Class, Fold, Super-family or Family levels, the predictive quality of the model in terms of Matthew's correlation coefficient (MCC) ranged from 0.61 to 0.83. As the number of constituent SSEs increases the MCC for classification to different structural levels decreases. CONCLUSIONS: The utility of random forest in classifying domains from the place-holder classes of SCOP to the true Class, Fold, Super-family or Family levels is demonstrated. Issues such as introduction of a new structural level in SCOP and the merger of singleton levels can also be addressed using random forest. A real-world scenario is mimicked by predicting the classification for those protein structures from the PDB, which are yet to be assigned to the SCOP classification hierarchy. BioMed Central 2010-07-01 /pmc/articles/PMC2916923/ /pubmed/20594334 http://dx.doi.org/10.1186/1471-2105-11-364 Text en Copyright ©2010 Jain and Hirst; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research Article Jain, Pooja Hirst, Jonathan D Automatic structure classification of small proteins using random forest
title	Automatic structure classification of small proteins using random forest
title_full	Automatic structure classification of small proteins using random forest
title_fullStr	Automatic structure classification of small proteins using random forest
title_full_unstemmed	Automatic structure classification of small proteins using random forest
title_short	Automatic structure classification of small proteins using random forest
title_sort	automatic structure classification of small proteins using random forest
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2916923/ https://www.ncbi.nlm.nih.gov/pubmed/20594334 http://dx.doi.org/10.1186/1471-2105-11-364
work_keys_str_mv	AT jainpooja automaticstructureclassificationofsmallproteinsusingrandomforest AT hirstjonathand automaticstructureclassificationofsmallproteinsusingrandomforest

Automatic structure classification of small proteins using random forest

Ejemplares similares