Cargando…
An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable
Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2016
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001642/ https://www.ncbi.nlm.nih.gov/pubmed/27564393 http://dx.doi.org/10.1371/journal.pone.0161788 |
_version_ | 1782450457734545408 |
---|---|
author | Korjus, Kristjan Hebart, Martin N. Vicente, Raul |
author_facet | Korjus, Kristjan Hebart, Martin N. Vicente, Raul |
author_sort | Korjus, Kristjan |
collection | PubMed |
description | Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier’s generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term “Cross-validation and cross-testing” improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do. |
format | Online Article Text |
id | pubmed-5001642 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2016 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-50016422016-09-12 An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable Korjus, Kristjan Hebart, Martin N. Vicente, Raul PLoS One Research Article Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier’s generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term “Cross-validation and cross-testing” improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do. Public Library of Science 2016-08-26 /pmc/articles/PMC5001642/ /pubmed/27564393 http://dx.doi.org/10.1371/journal.pone.0161788 Text en © 2016 Korjus et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Korjus, Kristjan Hebart, Martin N. Vicente, Raul An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable |
title | An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable |
title_full | An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable |
title_fullStr | An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable |
title_full_unstemmed | An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable |
title_short | An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable |
title_sort | efficient data partitioning to improve classification performance while keeping parameters interpretable |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001642/ https://www.ncbi.nlm.nih.gov/pubmed/27564393 http://dx.doi.org/10.1371/journal.pone.0161788 |
work_keys_str_mv | AT korjuskristjan anefficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable AT hebartmartinn anefficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable AT vicenteraul anefficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable AT korjuskristjan efficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable AT hebartmartinn efficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable AT vicenteraul efficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable |