Cargando…
Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study
MOTIVATION: Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, as...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10517638/ https://www.ncbi.nlm.nih.gov/pubmed/37745001 http://dx.doi.org/10.1093/bioadv/vbad128 |
_version_ | 1785109366923329536 |
---|---|
author | Molotkov, Ivan Artomov, Mykyta |
author_facet | Molotkov, Ivan Artomov, Mykyta |
author_sort | Molotkov, Ivan |
collection | PubMed |
description | MOTIVATION: Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience. RESULTS: We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications. AVAILABILITY AND IMPLEMENTATION: Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias. |
format | Online Article Text |
id | pubmed-10517638 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-105176382023-09-24 Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study Molotkov, Ivan Artomov, Mykyta Bioinform Adv Original Article MOTIVATION: Positive-unlabeled data consists of points with either positive or unknown labels. It is widespread in medical, genetic, and biological settings, creating a high demand for predictive positive-unlabeled models. The performance of such models is usually estimated using validation sets, assumed to be selected completely at random (SCAR) from known positive examples. For certain metrics, this assumption enables unbiased performance estimation when treating positive-unlabeled data as positive/negative. However, the SCAR assumption is often adopted without proper justifications, simply for the sake of convenience. RESULTS: We provide an algorithm that under the weak assumptions of a lower bound on the number of positive examples can test for the violation of the SCAR assumption. Applying it to the problem of gene prioritization for complex genetic traits, we illustrate that the SCAR assumption is often violated there, causing the inflation of performance estimates, which we refer to as validation bias. We estimate the potential impact of validation bias on performance estimation. Our analysis reveals that validation bias is widespread in gene prioritization data and can significantly overestimate the performance of models. This finding elucidates the discrepancy between the reported good performance of models and their limited practical applications. AVAILABILITY AND IMPLEMENTATION: Python code with examples of application of the validation bias detection algorithm is available at github.com/ArtomovLab/ValidationBias. Oxford University Press 2023-09-14 /pmc/articles/PMC10517638/ /pubmed/37745001 http://dx.doi.org/10.1093/bioadv/vbad128 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Molotkov, Ivan Artomov, Mykyta Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study |
title | Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study |
title_full | Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study |
title_fullStr | Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study |
title_full_unstemmed | Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study |
title_short | Detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study |
title_sort | detecting biased validation of predictive models in the positive-unlabeled setting: disease gene prioritization case study |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10517638/ https://www.ncbi.nlm.nih.gov/pubmed/37745001 http://dx.doi.org/10.1093/bioadv/vbad128 |
work_keys_str_mv | AT molotkovivan detectingbiasedvalidationofpredictivemodelsinthepositiveunlabeledsettingdiseasegeneprioritizationcasestudy AT artomovmykyta detectingbiasedvalidationofpredictivemodelsinthepositiveunlabeledsettingdiseasegeneprioritizationcasestudy |