Cargando…

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks...

Descripción completa

Detalles Bibliográficos
Autores principales:	An, Chansik, Park, Yae Won, Ahn, Sung Soo, Han, Kyunghwa, Kim, Hwiyoung, Lee, Seung-Koo
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2021
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8360533/ https://www.ncbi.nlm.nih.gov/pubmed/34383858 http://dx.doi.org/10.1371/journal.pone.0256152

_version_	1783737762345648128
author	An, Chansik Park, Yae Won Ahn, Sung Soo Han, Kyunghwa Kim, Hwiyoung Lee, Seung-Koo
author_facet	An, Chansik Park, Yae Won Ahn, Sung Soo Han, Kyunghwa Kim, Hwiyoung Lee, Seung-Koo
author_sort	An, Chansik
collection	PubMed
description	This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) “Simple” task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) “difficult” task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes.
format	Online Article Text
id	pubmed-8360533
institution	National Center for Biotechnology Information
language	English
publishDate	2021
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-83605332021-08-13 Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results An, Chansik Park, Yae Won Ahn, Sung Soo Han, Kyunghwa Kim, Hwiyoung Lee, Seung-Koo PLoS One Research Article This study aims to determine how randomly splitting a dataset into training and test sets affects the estimated performance of a machine learning model and its gap from the test performance under different conditions, using real-world brain tumor radiomics data. We conducted two classification tasks of different difficulty levels with magnetic resonance imaging (MRI) radiomics features: (1) “Simple” task, glioblastomas [n = 109] vs. brain metastasis [n = 58] and (2) “difficult” task, low- [n = 163] vs. high-grade [n = 95] meningiomas. Additionally, two undersampled datasets were created by randomly sampling 50% from these datasets. We performed random training-test set splitting for each dataset repeatedly to create 1,000 different training-test set pairs. For each dataset pair, the least absolute shrinkage and selection operator model was trained and evaluated using various validation methods in the training set, and tested in the test set, using the area under the curve (AUC) as an evaluation metric. The AUCs in training and testing varied among different training-test set pairs, especially with the undersampled datasets and the difficult task. The mean (±standard deviation) AUC difference between training and testing was 0.039 (±0.032) for the simple task without undersampling and 0.092 (±0.071) for the difficult task with undersampling. In a training-test set pair with the difficult task without undersampling, for example, the AUC was high in training but much lower in testing (0.882 and 0.667, respectively); in another dataset pair with the same task, however, the AUC was low in training but much higher in testing (0.709 and 0.911, respectively). When the AUC discrepancy between training and test, or generalization gap, was large, none of the validation methods helped sufficiently reduce the generalization gap. Our results suggest that machine learning after a single random training-test set split may lead to unreliable results in radiomics studies especially with small sample sizes. Public Library of Science 2021-08-12 /pmc/articles/PMC8360533/ /pubmed/34383858 http://dx.doi.org/10.1371/journal.pone.0256152 Text en © 2021 An et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article An, Chansik Park, Yae Won Ahn, Sung Soo Han, Kyunghwa Kim, Hwiyoung Lee, Seung-Koo Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results
title	Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results
title_full	Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results
title_fullStr	Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results
title_full_unstemmed	Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results
title_short	Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results
title_sort	radiomics machine learning study with a small sample size: single random training-test set split may lead to unreliable results
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8360533/ https://www.ncbi.nlm.nih.gov/pubmed/34383858 http://dx.doi.org/10.1371/journal.pone.0256152
work_keys_str_mv	AT anchansik radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults AT parkyaewon radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults AT ahnsungsoo radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults AT hankyunghwa radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults AT kimhwiyoung radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults AT leeseungkoo radiomicsmachinelearningstudywithasmallsamplesizesinglerandomtrainingtestsetsplitmayleadtounreliableresults

Radiomics machine learning study with a small sample size: Single random training-test set split may lead to unreliable results

Ejemplares similares