Cargando…

Improving protein fold recognition by random forest

BACKGROUND: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whet...

Descripción completa

Detalles Bibliográficos
Autores principales: Jo, Taeho, Cheng, Jianlin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251042/
https://www.ncbi.nlm.nih.gov/pubmed/25350499
http://dx.doi.org/10.1186/1471-2105-15-S11-S14
_version_ 1782346993110089728
author Jo, Taeho
Cheng, Jianlin
author_facet Jo, Taeho
Cheng, Jianlin
author_sort Jo, Taeho
collection PubMed
description BACKGROUND: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. RESULTS: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. CONCLUSIONS: The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition.
format Online
Article
Text
id pubmed-4251042
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42510422014-12-02 Improving protein fold recognition by random forest Jo, Taeho Cheng, Jianlin BMC Bioinformatics Proceedings BACKGROUND: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. RESULTS: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. CONCLUSIONS: The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition. BioMed Central 2014-10-21 /pmc/articles/PMC4251042/ /pubmed/25350499 http://dx.doi.org/10.1186/1471-2105-15-S11-S14 Text en Copyright © 2014 Jo and Cheng; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Jo, Taeho
Cheng, Jianlin
Improving protein fold recognition by random forest
title Improving protein fold recognition by random forest
title_full Improving protein fold recognition by random forest
title_fullStr Improving protein fold recognition by random forest
title_full_unstemmed Improving protein fold recognition by random forest
title_short Improving protein fold recognition by random forest
title_sort improving protein fold recognition by random forest
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251042/
https://www.ncbi.nlm.nih.gov/pubmed/25350499
http://dx.doi.org/10.1186/1471-2105-15-S11-S14
work_keys_str_mv AT jotaeho improvingproteinfoldrecognitionbyrandomforest
AT chengjianlin improvingproteinfoldrecognitionbyrandomforest