Cargando…
Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies
Accurately estimating performance accuracy of machine learning classifiers is of fundamental importance in biomedical research with potentially societal consequences upon the deployment of best-performing tools in everyday life. Although classification has been extensively studied over the past deca...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417800/ https://www.ncbi.nlm.nih.gov/pubmed/30864316 |
_version_ | 1783403625372975104 |
---|---|
author | Ramola, Rashika Jain, Shantanu Radivojac, Predrag |
author_facet | Ramola, Rashika Jain, Shantanu Radivojac, Predrag |
author_sort | Ramola, Rashika |
collection | PubMed |
description | Accurately estimating performance accuracy of machine learning classifiers is of fundamental importance in biomedical research with potentially societal consequences upon the deployment of best-performing tools in everyday life. Although classification has been extensively studied over the past decades, there remain understudied problems when the training data violate the main statistical assumptions relied upon for accurate learning and model characterization. This particularly holds true in the open world setting where observations of a phenomenon generally guarantee its presence but the absence of such evidence cannot be interpreted as the evidence of its absence. Learning from such data is often referred to as positive-unlabeled learning, a form of semi-supervised learning where all labeled data belong to one (say, positive) class. To improve the best practices in the field, we here study the quality of estimated performance in positive-unlabeled learning in the biomedical domain. We provide evidence that such estimates can be wildly inaccurate, depending on the fraction of positive examples in the unlabeled data and the fraction of negative examples mislabeled as positives in the labeled data. We then present correction methods for four such measures and demonstrate that the knowledge or accurate estimates of class priors in the unlabeled data and noise in the labeled data are sufficient for the recovery of true classification performance. We provide theoretical support as well as empirical evidence for the efficacy of the new performance estimation methods. |
format | Online Article Text |
id | pubmed-6417800 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
record_format | MEDLINE/PubMed |
spelling | pubmed-64178002019-03-14 Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies Ramola, Rashika Jain, Shantanu Radivojac, Predrag Pac Symp Biocomput Article Accurately estimating performance accuracy of machine learning classifiers is of fundamental importance in biomedical research with potentially societal consequences upon the deployment of best-performing tools in everyday life. Although classification has been extensively studied over the past decades, there remain understudied problems when the training data violate the main statistical assumptions relied upon for accurate learning and model characterization. This particularly holds true in the open world setting where observations of a phenomenon generally guarantee its presence but the absence of such evidence cannot be interpreted as the evidence of its absence. Learning from such data is often referred to as positive-unlabeled learning, a form of semi-supervised learning where all labeled data belong to one (say, positive) class. To improve the best practices in the field, we here study the quality of estimated performance in positive-unlabeled learning in the biomedical domain. We provide evidence that such estimates can be wildly inaccurate, depending on the fraction of positive examples in the unlabeled data and the fraction of negative examples mislabeled as positives in the labeled data. We then present correction methods for four such measures and demonstrate that the knowledge or accurate estimates of class priors in the unlabeled data and noise in the labeled data are sufficient for the recovery of true classification performance. We provide theoretical support as well as empirical evidence for the efficacy of the new performance estimation methods. 2019 /pmc/articles/PMC6417800/ /pubmed/30864316 Text en http://creativecommons.org/licenses/by/4.0/ Open Access chapter published by World Scientific Publishing Company and distributed under the terms of the Creative Commons Attribution Non-Commercial (CC BY-NC) 4.0 License. |
spellingShingle | Article Ramola, Rashika Jain, Shantanu Radivojac, Predrag Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies |
title | Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies |
title_full | Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies |
title_fullStr | Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies |
title_full_unstemmed | Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies |
title_short | Estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies |
title_sort | estimating classification accuracy in positive-unlabeled learning: characterization and correction strategies |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6417800/ https://www.ncbi.nlm.nih.gov/pubmed/30864316 |
work_keys_str_mv | AT ramolarashika estimatingclassificationaccuracyinpositiveunlabeledlearningcharacterizationandcorrectionstrategies AT jainshantanu estimatingclassificationaccuracyinpositiveunlabeledlearningcharacterizationandcorrectionstrategies AT radivojacpredrag estimatingclassificationaccuracyinpositiveunlabeledlearningcharacterizationandcorrectionstrategies |