Cargando…
Improving protein fold recognition by random forest
BACKGROUND: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whet...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251042/ https://www.ncbi.nlm.nih.gov/pubmed/25350499 http://dx.doi.org/10.1186/1471-2105-15-S11-S14 |
_version_ | 1782346993110089728 |
---|---|
author | Jo, Taeho Cheng, Jianlin |
author_facet | Jo, Taeho Cheng, Jianlin |
author_sort | Jo, Taeho |
collection | PubMed |
description | BACKGROUND: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. RESULTS: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. CONCLUSIONS: The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition. |
format | Online Article Text |
id | pubmed-4251042 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-42510422014-12-02 Improving protein fold recognition by random forest Jo, Taeho Cheng, Jianlin BMC Bioinformatics Proceedings BACKGROUND: Recognizing the correct structural fold among known template protein structures for a target protein (i.e. fold recognition) is essential for template-based protein structure modeling. Since the fold recognition problem can be defined as a binary classification problem of predicting whether or not the unknown fold of a target protein is similar to an already known template protein structure in a library, machine learning methods have been effectively applied to tackle this problem. In our work, we developed RF-Fold that uses random forest - one of the most powerful and scalable machine learning classification methods - to recognize protein folds. RESULTS: RF-Fold consists of hundreds of decision trees that can be trained efficiently on very large datasets to make accurate predictions on a highly imbalanced dataset. We evaluated RF-Fold on the standard Lindahl's benchmark dataset comprised of 976 × 975 target-template protein pairs through cross-validation. Compared with 17 different fold recognition methods, the performance of RF-Fold is generally comparable to the best performance in fold recognition of different difficulty ranging from the easiest family level, the medium-hard superfamily level, and to the hardest fold level. Based on the top-one template protein ranked by RF-Fold, the correct recognition rate is 84.5%, 63.4%, and 40.8% at family, superfamily, and fold levels, respectively. Based on the top-five template protein folds ranked by RF-Fold, the correct recognition rate increases to 91.5%, 79.3% and 58.3% at family, superfamily, and fold levels. CONCLUSIONS: The good performance achieved by the RF-Fold demonstrates the random forest's effectiveness for protein fold recognition. BioMed Central 2014-10-21 /pmc/articles/PMC4251042/ /pubmed/25350499 http://dx.doi.org/10.1186/1471-2105-15-S11-S14 Text en Copyright © 2014 Jo and Cheng; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Proceedings Jo, Taeho Cheng, Jianlin Improving protein fold recognition by random forest |
title | Improving protein fold recognition by random forest |
title_full | Improving protein fold recognition by random forest |
title_fullStr | Improving protein fold recognition by random forest |
title_full_unstemmed | Improving protein fold recognition by random forest |
title_short | Improving protein fold recognition by random forest |
title_sort | improving protein fold recognition by random forest |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251042/ https://www.ncbi.nlm.nih.gov/pubmed/25350499 http://dx.doi.org/10.1186/1471-2105-15-S11-S14 |
work_keys_str_mv | AT jotaeho improvingproteinfoldrecognitionbyrandomforest AT chengjianlin improvingproteinfoldrecognitionbyrandomforest |