Cargando…

Don’t lose samples to estimation

In a typical predictive modeling task, we are asked to produce a final predictive model to employ operationally for predictions, as well as an estimate of its out-of-sample predictive performance. Typically, analysts hold out a portion of the available data, called a Test set, to estimate the model...

Descripción completa

Detalles Bibliográficos
Autor principal: Tsamardinos, Ioannis
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Elsevier 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782254/
https://www.ncbi.nlm.nih.gov/pubmed/36569551
http://dx.doi.org/10.1016/j.patter.2022.100612
_version_ 1784857298592595968
author Tsamardinos, Ioannis
author_facet Tsamardinos, Ioannis
author_sort Tsamardinos, Ioannis
collection PubMed
description In a typical predictive modeling task, we are asked to produce a final predictive model to employ operationally for predictions, as well as an estimate of its out-of-sample predictive performance. Typically, analysts hold out a portion of the available data, called a Test set, to estimate the model predictive performance on unseen (out-of-sample) records, thus “losing these samples to estimation.” However, this practice is unacceptable when the total sample size is low. To avoid losing data to estimation, we need a shift in our perspective: we do not estimate the performance of a specific model instance; we estimate the performance of the pipeline that produces the model. This pipeline is applied on all available samples to produce the final model; no samples are lost to estimation. An estimate of its performance is provided by training the same pipeline on subsets of the samples. When multiple pipelines are tried, additional considerations that correct for the “winner’s curse” need to be in place.
format Online
Article
Text
id pubmed-9782254
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Elsevier
record_format MEDLINE/PubMed
spelling pubmed-97822542022-12-24 Don’t lose samples to estimation Tsamardinos, Ioannis Patterns (N Y) Perspective In a typical predictive modeling task, we are asked to produce a final predictive model to employ operationally for predictions, as well as an estimate of its out-of-sample predictive performance. Typically, analysts hold out a portion of the available data, called a Test set, to estimate the model predictive performance on unseen (out-of-sample) records, thus “losing these samples to estimation.” However, this practice is unacceptable when the total sample size is low. To avoid losing data to estimation, we need a shift in our perspective: we do not estimate the performance of a specific model instance; we estimate the performance of the pipeline that produces the model. This pipeline is applied on all available samples to produce the final model; no samples are lost to estimation. An estimate of its performance is provided by training the same pipeline on subsets of the samples. When multiple pipelines are tried, additional considerations that correct for the “winner’s curse” need to be in place. Elsevier 2022-12-09 /pmc/articles/PMC9782254/ /pubmed/36569551 http://dx.doi.org/10.1016/j.patter.2022.100612 Text en © 2022 The Author https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Perspective
Tsamardinos, Ioannis
Don’t lose samples to estimation
title Don’t lose samples to estimation
title_full Don’t lose samples to estimation
title_fullStr Don’t lose samples to estimation
title_full_unstemmed Don’t lose samples to estimation
title_short Don’t lose samples to estimation
title_sort don’t lose samples to estimation
topic Perspective
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9782254/
https://www.ncbi.nlm.nih.gov/pubmed/36569551
http://dx.doi.org/10.1016/j.patter.2022.100612
work_keys_str_mv AT tsamardinosioannis dontlosesamplestoestimation