Cargando…

A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms

BACKGROUND: In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informat...

Descripción completa

Detalles Bibliográficos
Autores principales:	Carrington, André M., Fieguth, Paul W., Qazi, Hammad, Holzinger, Andreas, Chen, Helen H., Mayr, Franz, Manuel, Douglas G.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2020
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6945414/ https://www.ncbi.nlm.nih.gov/pubmed/31906931 http://dx.doi.org/10.1186/s12911-019-1014-6

_version_	1783485174181265408
author	Carrington, André M. Fieguth, Paul W. Qazi, Hammad Holzinger, Andreas Chen, Helen H. Mayr, Franz Manuel, Douglas G.
author_facet	Carrington, André M. Fieguth, Paul W. Qazi, Hammad Holzinger, Andreas Chen, Helen H. Mayr, Franz Manuel, Douglas G.
author_sort	Carrington, André M.
collection	PubMed
description	BACKGROUND: In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informative however when they are used with imbalanced data. Hence, alternatives to the AUC have been proposed, such as the partial AUC and the area under the precision-recall curve. However, these alternatives cannot be as fully interpreted as the AUC, in part because they ignore some information about actual negatives. METHODS: We derive and propose a new concordant partial AUC and a new partial c statistic for ROC data—as foundational measures and methods to help understand and explain parts of the ROC plot and AUC. Our partial measures are continuous and discrete versions of the same measure, are derived from the AUC and c statistic respectively, are validated as equal to each other, and validated as equal in summation to whole measures where expected. Our partial measures are tested for validity on a classic ROC example from Fawcett, a variation thereof, and two real-life benchmark data sets in breast cancer: the Wisconsin and Ljubljana data sets. Interpretation of an example is then provided. RESULTS: Results show the expected equalities between our new partial measures and the existing whole measures. The example interpretation illustrates the need for our newly derived partial measures. CONCLUSIONS: The concordant partial area under the ROC curve was proposed and unlike previous partial measure alternatives, it maintains the characteristics of the AUC. The first partial c statistic for ROC plots was also proposed as an unbiased interpretation for part of an ROC curve. The expected equalities among and between our newly derived partial measures and their existing full measure counterparts are confirmed. These measures may be used with any data set but this paper focuses on imbalanced data with low prevalence. FUTURE WORK: Future work with our proposed measures may: demonstrate their value for imbalanced data with high prevalence, compare them to other measures not based on areas; and combine them with other ROC measures and techniques.
format	Online Article Text
id	pubmed-6945414
institution	National Center for Biotechnology Information
language	English
publishDate	2020
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-69454142020-01-09 A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms Carrington, André M. Fieguth, Paul W. Qazi, Hammad Holzinger, Andreas Chen, Helen H. Mayr, Franz Manuel, Douglas G. BMC Med Inform Decis Mak Research Article BACKGROUND: In classification and diagnostic testing, the receiver-operator characteristic (ROC) plot and the area under the ROC curve (AUC) describe how an adjustable threshold causes changes in two types of error: false positives and false negatives. Only part of the ROC curve and AUC are informative however when they are used with imbalanced data. Hence, alternatives to the AUC have been proposed, such as the partial AUC and the area under the precision-recall curve. However, these alternatives cannot be as fully interpreted as the AUC, in part because they ignore some information about actual negatives. METHODS: We derive and propose a new concordant partial AUC and a new partial c statistic for ROC data—as foundational measures and methods to help understand and explain parts of the ROC plot and AUC. Our partial measures are continuous and discrete versions of the same measure, are derived from the AUC and c statistic respectively, are validated as equal to each other, and validated as equal in summation to whole measures where expected. Our partial measures are tested for validity on a classic ROC example from Fawcett, a variation thereof, and two real-life benchmark data sets in breast cancer: the Wisconsin and Ljubljana data sets. Interpretation of an example is then provided. RESULTS: Results show the expected equalities between our new partial measures and the existing whole measures. The example interpretation illustrates the need for our newly derived partial measures. CONCLUSIONS: The concordant partial area under the ROC curve was proposed and unlike previous partial measure alternatives, it maintains the characteristics of the AUC. The first partial c statistic for ROC plots was also proposed as an unbiased interpretation for part of an ROC curve. The expected equalities among and between our newly derived partial measures and their existing full measure counterparts are confirmed. These measures may be used with any data set but this paper focuses on imbalanced data with low prevalence. FUTURE WORK: Future work with our proposed measures may: demonstrate their value for imbalanced data with high prevalence, compare them to other measures not based on areas; and combine them with other ROC measures and techniques. BioMed Central 2020-01-06 /pmc/articles/PMC6945414/ /pubmed/31906931 http://dx.doi.org/10.1186/s12911-019-1014-6 Text en © The Author(s). 2020 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Carrington, André M. Fieguth, Paul W. Qazi, Hammad Holzinger, Andreas Chen, Helen H. Mayr, Franz Manuel, Douglas G. A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_full	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_fullStr	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_full_unstemmed	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_short	A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
title_sort	new concordant partial auc and partial c statistic for imbalanced data in the evaluation of machine learning algorithms
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6945414/ https://www.ncbi.nlm.nih.gov/pubmed/31906931 http://dx.doi.org/10.1186/s12911-019-1014-6
work_keys_str_mv	AT carringtonandrem anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT fieguthpaulw anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT qazihammad anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT holzingerandreas anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT chenhelenh anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT mayrfranz anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT manueldouglasg anewconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT carringtonandrem newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT fieguthpaulw newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT qazihammad newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT holzingerandreas newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT chenhelenh newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT mayrfranz newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms AT manueldouglasg newconcordantpartialaucandpartialcstatisticforimbalanceddataintheevaluationofmachinelearningalgorithms

A new concordant partial AUC and partial c statistic for imbalanced data in the evaluation of machine learning algorithms

Ejemplares similares