Cargando…

Machine learning algorithm validation with a limited sample size

Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participant...

Descripción completa

Detalles Bibliográficos
Autores principales:	Vabalas, Andrius, Gowen, Emma, Poliakoff, Ellen, Casson, Alexander J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2019
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6837442/ https://www.ncbi.nlm.nih.gov/pubmed/31697686 http://dx.doi.org/10.1371/journal.pone.0224365

_version_	1783467075026550784
author	Vabalas, Andrius Gowen, Emma Poliakoff, Ellen Casson, Alexander J.
author_facet	Vabalas, Andrius Gowen, Emma Poliakoff, Ellen Casson, Alexander J.
author_sort	Vabalas, Andrius
collection	PubMed
description	Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used.
format	Online Article Text
id	pubmed-6837442
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-68374422019-11-14 Machine learning algorithm validation with a limited sample size Vabalas, Andrius Gowen, Emma Poliakoff, Ellen Casson, Alexander J. PLoS One Research Article Advances in neuroimaging, genomic, motion tracking, eye-tracking and many other technology-based data collection methods have led to a torrent of high dimensional datasets, which commonly have a small number of samples because of the intrinsic high cost of data collection involving human participants. High dimensional data with a small number of samples is of critical importance for identifying biomarkers and conducting feasibility and pilot work, however it can lead to biased machine learning (ML) performance estimates. Our review of studies which have applied ML to predict autistic from non-autistic individuals showed that small sample size is associated with higher reported classification accuracy. Thus, we have investigated whether this bias could be caused by the use of validation methods which do not sufficiently control overfitting. Our simulations show that K-fold Cross-Validation (CV) produces strongly biased performance estimates with small sample sizes, and the bias is still evident with sample size of 1000. Nested CV and train/test split approaches produce robust and unbiased performance estimates regardless of sample size. We also show that feature selection if performed on pooled training and testing data is contributing to bias considerably more than parameter tuning. In addition, the contribution to bias by data dimensionality, hyper-parameter space and number of CV folds was explored, and validation methods were compared with discriminable data. The results suggest how to design robust testing methodologies when working with small datasets and how to interpret the results of other studies based on what validation method was used. Public Library of Science 2019-11-07 /pmc/articles/PMC6837442/ /pubmed/31697686 http://dx.doi.org/10.1371/journal.pone.0224365 Text en © 2019 Vabalas et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Vabalas, Andrius Gowen, Emma Poliakoff, Ellen Casson, Alexander J. Machine learning algorithm validation with a limited sample size
title	Machine learning algorithm validation with a limited sample size
title_full	Machine learning algorithm validation with a limited sample size
title_fullStr	Machine learning algorithm validation with a limited sample size
title_full_unstemmed	Machine learning algorithm validation with a limited sample size
title_short	Machine learning algorithm validation with a limited sample size
title_sort	machine learning algorithm validation with a limited sample size
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6837442/ https://www.ncbi.nlm.nih.gov/pubmed/31697686 http://dx.doi.org/10.1371/journal.pone.0224365
work_keys_str_mv	AT vabalasandrius machinelearningalgorithmvalidationwithalimitedsamplesize AT gowenemma machinelearningalgorithmvalidationwithalimitedsamplesize AT poliakoffellen machinelearningalgorithmvalidationwithalimitedsamplesize AT cassonalexanderj machinelearningalgorithmvalidationwithalimitedsamplesize

Machine learning algorithm validation with a limited sample size

Ejemplares similares