Cargando…

Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies

Accurately estimating performance accuracy of machine learning classifiers is of fundamental importance in biomedical research with potentially societal consequences upon the deployment of best-performing tools in everyday life. Although classification has been extensively studied over the past deca...

Descripción completa

Detalles Bibliográficos
Autores principales: Ramola, Rashika, Jain, Shantanu, Radivojac, Predrag
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417800/
https://www.ncbi.nlm.nih.gov/pubmed/30864316
_version_ 1783403625372975104
author Ramola, Rashika
Jain, Shantanu
Radivojac, Predrag
author_facet Ramola, Rashika
Jain, Shantanu
Radivojac, Predrag
author_sort Ramola, Rashika
collection PubMed
description Accurately estimating performance accuracy of machine learning classifiers is of fundamental importance in biomedical research with potentially societal consequences upon the deployment of best-performing tools in everyday life. Although classification has been extensively studied over the past decades, there remain understudied problems when the training data violate the main statistical assumptions relied upon for accurate learning and model characterization. This particularly holds true in the open world setting where observations of a phenomenon generally guarantee its presence but the absence of such evidence cannot be interpreted as the evidence of its absence. Learning from such data is often referred to as positive-unlabeled learning, a form of semi-supervised learning where all labeled data belong to one (say, positive) class. To improve the best practices in the field, we here study the quality of estimated performance in positive-unlabeled learning in the biomedical domain. We provide evidence that such estimates can be wildly inaccurate, depending on the fraction of positive examples in the unlabeled data and the fraction of negative examples mislabeled as positives in the labeled data. We then present correction methods for four such measures and demonstrate that the knowledge or accurate estimates of class priors in the unlabeled data and noise in the labeled data are sufficient for the recovery of true classification performance. We provide theoretical support as well as empirical evidence for the efficacy of the new performance estimation methods.
format Online
Article
Text
id pubmed-6417800
institution National Center for Biotechnology Information
language English
publishDate 2019
record_format MEDLINE/PubMed
spelling pubmed-64178002019-03-14 Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies Ramola, Rashika Jain, Shantanu Radivojac, Predrag Pac Symp Biocomput Article Accurately estimating performance accuracy of machine learning classifiers is of fundamental importance in biomedical research with potentially societal consequences upon the deployment of best-performing tools in everyday life. Although classification has been extensively studied over the past decades, there remain understudied problems when the training data violate the main statistical assumptions relied upon for accurate learning and model characterization. This particularly holds true in the open world setting where observations of a phenomenon generally guarantee its presence but the absence of such evidence cannot be interpreted as the evidence of its absence. Learning from such data is often referred to as positive-unlabeled learning, a form of semi-supervised learning where all labeled data belong to one (say, positive) class. To improve the best practices in the field, we here study the quality of estimated performance in positive-unlabeled learning in the biomedical domain. We provide evidence that such estimates can be wildly inaccurate, depending on the fraction of positive examples in the unlabeled data and the fraction of negative examples mislabeled as positives in the labeled data. We then present correction methods for four such measures and demonstrate that the knowledge or accurate estimates of class priors in the unlabeled data and noise in the labeled data are sufficient for the recovery of true classification performance. We provide theoretical support as well as empirical evidence for the efficacy of the new performance estimation methods. 2019 /pmc/articles/PMC6417800/ /pubmed/30864316 Text en http://creativecommons.org/licenses/by/4.0/ Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License.
spellingShingle Article
Ramola, Rashika
Jain, Shantanu
Radivojac, Predrag
Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies
title Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies
title_full Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies
title_fullStr Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies
title_full_unstemmed Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies
title_short Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies
title_sort estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417800/
https://www.ncbi.nlm.nih.gov/pubmed/30864316
work_keys_str_mv AT ramolarashika estimatingclassificationaccuracyinpositiveunlabeledlearningcharacterizationandcorrectionstrategies
AT jainshantanu estimatingclassificationaccuracyinpositiveunlabeledlearningcharacterizationandcorrectionstrategies
AT radivojacpredrag estimatingclassificationaccuracyinpositiveunlabeledlearningcharacterizationandcorrectionstrategies