Cargando…

An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable

Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application...

Descripción completa

Detalles Bibliográficos
Autores principales: Korjus, Kristjan, Hebart, Martin N., Vicente, Raul
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2016
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001642/
https://www.ncbi.nlm.nih.gov/pubmed/27564393
http://dx.doi.org/10.1371/journal.pone.0161788
_version_ 1782450457734545408
author Korjus, Kristjan
Hebart, Martin N.
Vicente, Raul
author_facet Korjus, Kristjan
Hebart, Martin N.
Vicente, Raul
author_sort Korjus, Kristjan
collection PubMed
description Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier’s generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term “Cross-validation and cross-testing” improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do.
format Online
Article
Text
id pubmed-5001642
institution National Center for Biotechnology Information
language English
publishDate 2016
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-50016422016-09-12 An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable Korjus, Kristjan Hebart, Martin N. Vicente, Raul PLoS One Research Article Supervised machine learning methods typically require splitting data into multiple chunks for training, validating, and finally testing classifiers. For finding the best parameters of a classifier, training and validation are usually carried out with cross-validation. This is followed by application of the classifier with optimized parameters to a separate test set for estimating the classifier’s generalization performance. With limited data, this separation of test data creates a difficult trade-off between having more statistical power in estimating generalization performance versus choosing better parameters and fitting a better model. We propose a novel approach that we term “Cross-validation and cross-testing” improving this trade-off by re-using test data without biasing classifier performance. The novel approach is validated using simulated data and electrophysiological recordings in humans and rodents. The results demonstrate that the approach has a higher probability of discovering significant results than the standard approach of cross-validation and testing, while maintaining the nominal alpha level. In contrast to nested cross-validation, which is maximally efficient in re-using data, the proposed approach additionally maintains the interpretability of individual parameters. Taken together, we suggest an addition to currently used machine learning approaches which may be particularly useful in cases where model weights do not require interpretation, but parameters do. Public Library of Science 2016-08-26 /pmc/articles/PMC5001642/ /pubmed/27564393 http://dx.doi.org/10.1371/journal.pone.0161788 Text en © 2016 Korjus et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Korjus, Kristjan
Hebart, Martin N.
Vicente, Raul
An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable
title An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable
title_full An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable
title_fullStr An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable
title_full_unstemmed An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable
title_short An Efficient Data Partitioning to Improve Classification Performance While Keeping Parameters Interpretable
title_sort efficient data partitioning to improve classification performance while keeping parameters interpretable
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5001642/
https://www.ncbi.nlm.nih.gov/pubmed/27564393
http://dx.doi.org/10.1371/journal.pone.0161788
work_keys_str_mv AT korjuskristjan anefficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable
AT hebartmartinn anefficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable
AT vicenteraul anefficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable
AT korjuskristjan efficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable
AT hebartmartinn efficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable
AT vicenteraul efficientdatapartitioningtoimproveclassificationperformancewhilekeepingparametersinterpretable