Cargando…

Application of an interpretable classification model on Early Folding Residues during protein folding

BACKGROUND: Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend t...

Descripción completa

Detalles Bibliográficos
Autores principales: Bittrich, Sebastian, Kaden, Marika, Leberecht, Christoph, Kaiser, Florian, Villmann, Thomas, Labudde, Dirk
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6321665/
https://www.ncbi.nlm.nih.gov/pubmed/30627219
http://dx.doi.org/10.1186/s13040-018-0188-2
_version_ 1783385495868276736
author Bittrich, Sebastian
Kaden, Marika
Leberecht, Christoph
Kaiser, Florian
Villmann, Thomas
Labudde, Dirk
author_facet Bittrich, Sebastian
Kaden, Marika
Leberecht, Christoph
Kaiser, Florian
Villmann, Thomas
Labudde, Dirk
author_sort Bittrich, Sebastian
collection PubMed
description BACKGROUND: Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models. RESULTS: Generalized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learning method and provides comprehensive visualization capabilities not present in other classifiers which allow for a fine-grained interpretation of the data. In contrast to commonly used machine learning strategies, GMLVQ is well-suited for imbalanced classification problems which are frequent in life sciences. We present a Weka plug-in implementing GMLVQ. The feasibility of GMLVQ is demonstrated on a dataset of Early Folding Residues (EFR) that have been shown to initiate and guide the protein folding process. Using 27 features, an area under the receiver operating characteristic of 76.6% was achieved which is comparable to other state-of-the-art classifiers. The obtained model is accessible at https://biosciences.hs-mittweida.de/efpred/. CONCLUSIONS: The application on EFR prediction demonstrates how an easy interpretation of classification models can promote the comprehension of biological mechanisms. The results shed light on the special features of EFR which were reported as most influential for the classification: EFR are embedded in ordered secondary structure elements and they participate in networks of hydrophobic residues. Visualization capabilities of GMLVQ are presented as we demonstrate how to interpret the results. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13040-018-0188-2) contains supplementary material, which is available to authorized users.
format Online
Article
Text
id pubmed-6321665
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-63216652019-01-09 Application of an interpretable classification model on Early Folding Residues during protein folding Bittrich, Sebastian Kaden, Marika Leberecht, Christoph Kaiser, Florian Villmann, Thomas Labudde, Dirk BioData Min Methodology BACKGROUND: Machine learning strategies are prominent tools for data analysis. Especially in life sciences, they have become increasingly important to handle the growing datasets collected by the scientific community. Meanwhile, algorithms improve in performance, but also gain complexity, and tend to neglect interpretability and comprehensiveness of the resulting models. RESULTS: Generalized Matrix Learning Vector Quantization (GMLVQ) is a supervised, prototype-based machine learning method and provides comprehensive visualization capabilities not present in other classifiers which allow for a fine-grained interpretation of the data. In contrast to commonly used machine learning strategies, GMLVQ is well-suited for imbalanced classification problems which are frequent in life sciences. We present a Weka plug-in implementing GMLVQ. The feasibility of GMLVQ is demonstrated on a dataset of Early Folding Residues (EFR) that have been shown to initiate and guide the protein folding process. Using 27 features, an area under the receiver operating characteristic of 76.6% was achieved which is comparable to other state-of-the-art classifiers. The obtained model is accessible at https://biosciences.hs-mittweida.de/efpred/. CONCLUSIONS: The application on EFR prediction demonstrates how an easy interpretation of classification models can promote the comprehension of biological mechanisms. The results shed light on the special features of EFR which were reported as most influential for the classification: EFR are embedded in ordered secondary structure elements and they participate in networks of hydrophobic residues. Visualization capabilities of GMLVQ are presented as we demonstrate how to interpret the results. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (10.1186/s13040-018-0188-2) contains supplementary material, which is available to authorized users. BioMed Central 2019-01-05 /pmc/articles/PMC6321665/ /pubmed/30627219 http://dx.doi.org/10.1186/s13040-018-0188-2 Text en © The Author(s) 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Methodology
Bittrich, Sebastian
Kaden, Marika
Leberecht, Christoph
Kaiser, Florian
Villmann, Thomas
Labudde, Dirk
Application of an interpretable classification model on Early Folding Residues during protein folding
title Application of an interpretable classification model on Early Folding Residues during protein folding
title_full Application of an interpretable classification model on Early Folding Residues during protein folding
title_fullStr Application of an interpretable classification model on Early Folding Residues during protein folding
title_full_unstemmed Application of an interpretable classification model on Early Folding Residues during protein folding
title_short Application of an interpretable classification model on Early Folding Residues during protein folding
title_sort application of an interpretable classification model on early folding residues during protein folding
topic Methodology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6321665/
https://www.ncbi.nlm.nih.gov/pubmed/30627219
http://dx.doi.org/10.1186/s13040-018-0188-2
work_keys_str_mv AT bittrichsebastian applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding
AT kadenmarika applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding
AT leberechtchristoph applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding
AT kaiserflorian applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding
AT villmannthomas applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding
AT labuddedirk applicationofaninterpretableclassificationmodelonearlyfoldingresiduesduringproteinfolding