Cargando…

Improving protein fold recognition by random forest

BACKGROUND: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whet...

Descripción completa

Detalles Bibliográficos
Autores principales:	Jo, Taeho, Cheng, Jianlin
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2014
Materias:	Proceedings
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251042/ https://www.ncbi.nlm.nih.gov/pubmed/25350499 http://dx.doi.org/10.1186/1471-2105-15-S11-S14

_version_	1782346993110089728
author	Jo, Taeho Cheng, Jianlin
author_facet	Jo, Taeho Cheng, Jianlin
author_sort	Jo, Taeho
collection	PubMed
description	BACKGROUND: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. RESULTS: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. CONCLUSIONS: The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition.
format	Online Article Text
id	pubmed-4251042
institution	National Center for Biotechnology Information
language	English
publishDate	2014
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-42510422014-12-02 Improving protein fold recognition by random forest Jo, Taeho Cheng, Jianlin BMC Bioinformatics Proceedings BACKGROUND: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. RESULTS: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. CONCLUSIONS: The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition. BioMed Central 2014-10-21 /pmc/articles/PMC4251042/ /pubmed/25350499 http://dx.doi.org/10.1186/1471-2105-15-S11-S14 Text en Copyright © 2014 Jo and Cheng; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Proceedings Jo, Taeho Cheng, Jianlin Improving protein fold recognition by random forest
title	Improving protein fold recognition by random forest
title_full	Improving protein fold recognition by random forest
title_fullStr	Improving protein fold recognition by random forest
title_full_unstemmed	Improving protein fold recognition by random forest
title_short	Improving protein fold recognition by random forest
title_sort	improving protein fold recognition by random forest
topic	Proceedings
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251042/ https://www.ncbi.nlm.nih.gov/pubmed/25350499 http://dx.doi.org/10.1186/1471-2105-15-S11-S14
work_keys_str_mv	AT jotaeho improvingproteinfoldrecognitionbyrandomforest AT chengjianlin improvingproteinfoldrecognitionbyrandomforest

Improving protein fold recognition by random forest

Ejemplares similares