Cargando…
Application of an interpretable classification model on Early Folding Residues during protein folding
BACKGROUND: Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend t...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6321665/ https://www.ncbi.nlm.nih.gov/pubmed/30627219 http://dx.doi.org/10.1186/s13040-018-0188-2 |
_version_ | 1783385495868276736 |
---|---|
author | Bittrich, Sebastian Kaden, Marika Leberecht, Christoph Kaiser, Florian Villmann, Thomas Labudde, Dirk |
author_facet | Bittrich, Sebastian Kaden, Marika Leberecht, Christoph Kaiser, Florian Villmann, Thomas Labudde, Dirk |
author_sort | Bittrich, Sebastian |
collection | PubMed |
description | BACKGROUND: Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models. RESULTS: Generalized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learning method and provides comprehensive visualization capabilities not present in other classifiers which allow for a fine-grained interpretation of the data. In contrast to commonly used machine learning strategies, GMLVQ is well-suited for imbalanced classification problems which are frequent in life sciences. We present a Weka plug-in implementing GMLVQ. The feasibility of GMLVQ is demonstrated on a dataset of Early Folding Residues (EFR) that have been shown to initiate and guide the protein folding process. Using 27 features, an area under the receiver operating characteristic of 76.6% was achieved which is comparable to other state-of-the-art classifiers. The obtained model is accessible at https://biosciences.hs-mittweida.de/efpred/. CONCLUSIONS: The application on EFR prediction demonstrates how an easy interpretation of classification models can promote the comprehension of biological mechanisms. The results shed light on the special features of EFR which were reported as most influential for the classification: EFR are embedded in ordered secondary structure elements and they participate in networks of hydrophobic residues. Visualization capabilities of GMLVQ are presented as we demonstrate how to interpret the results. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13040-018-0188-2) contains supplementary material, which is available to authorized users. |
format | Online Article Text |
id | pubmed-6321665 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-63216652019-01-09 Application of an interpretable classification model on Early Folding Residues during protein folding Bittrich, Sebastian Kaden, Marika Leberecht, Christoph Kaiser, Florian Villmann, Thomas Labudde, Dirk BioData Min Methodology BACKGROUND: Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models. RESULTS: Generalized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learning method and provides comprehensive visualization capabilities not present in other classifiers which allow for a fine-grained interpretation of the data. In contrast to commonly used machine learning strategies, GMLVQ is well-suited for imbalanced classification problems which are frequent in life sciences. We present a Weka plug-in implementing GMLVQ. The feasibility of GMLVQ is demonstrated on a dataset of Early Folding Residues (EFR) that have been shown to initiate and guide the protein folding process. Using 27 features, an area under the receiver operating characteristic of 76.6% was achieved which is comparable to other state-of-the-art classifiers. The obtained model is accessible at https://biosciences.hs-mittweida.de/efpred/. CONCLUSIONS: The application on EFR prediction demonstrates how an easy interpretation of classification models can promote the comprehension of biological mechanisms. The results shed light on the special features of EFR which were reported as most influential for the classification: EFR are embedded in ordered secondary structure elements and they participate in networks of hydrophobic residues. Visualization capabilities of GMLVQ are presented as we demonstrate how to interpret the results. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13040-018-0188-2) contains supplementary material, which is available to authorized users. BioMed Central 2019-01-05 /pmc/articles/PMC6321665/ /pubmed/30627219 http://dx.doi.org/10.1186/s13040-018-0188-2 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Methodology Bittrich, Sebastian Kaden, Marika Leberecht, Christoph Kaiser, Florian Villmann, Thomas Labudde, Dirk Application of an interpretable classification model on Early Folding Residues during protein folding |
title | Application of an interpretable classification model on Early Folding Residues during protein folding |
title_full | Application of an interpretable classification model on Early Folding Residues during protein folding |
title_fullStr | Application of an interpretable classification model on Early Folding Residues during protein folding |
title_full_unstemmed | Application of an interpretable classification model on Early Folding Residues during protein folding |
title_short | Application of an interpretable classification model on Early Folding Residues during protein folding |
title_sort | application of an interpretable classification model on early folding residues during protein folding |
topic | Methodology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6321665/ https://www.ncbi.nlm.nih.gov/pubmed/30627219 http://dx.doi.org/10.1186/s13040-018-0188-2 |
work_keys_str_mv | AT bittrichsebastian applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding AT kadenmarika applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding AT leberechtchristoph applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding AT kaiserflorian applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding AT villmannthomas applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding AT labuddedirk applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding |