Cargando…

Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric

Real-world datasets are often contaminated with label noise; labeling is not a clear-cut process and reliable methods tend to be expensive or time-consuming. Depending on the learning technique used, such label noise is potentially harmful, requiring an increased size of the training set, making the...

Descripción completa

Detalles Bibliográficos
Autores principales: Feng, Wei, Quan, Yinghui, Dauphin, Gabriel
Formato: Online Artículo Texto
Lenguaje:English
Publicado: MDPI 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727820/
https://www.ncbi.nlm.nih.gov/pubmed/33255363
http://dx.doi.org/10.3390/s20236718
_version_ 1783621137214734336
author Feng, Wei
Quan, Yinghui
Dauphin, Gabriel
author_facet Feng, Wei
Quan, Yinghui
Dauphin, Gabriel
author_sort Feng, Wei
collection PubMed
description Real-world datasets are often contaminated with label noise; labeling is not a clear-cut process and reliable methods tend to be expensive or time-consuming. Depending on the learning technique used, such label noise is potentially harmful, requiring an increased size of the training set, making the trained model more complex and more prone to overfitting and yielding less accurate prediction. This work proposes a cleaning technique called the ensemble method based on the noise detection metric (ENDM). From the corrupted training set, an ensemble classifier is first learned and used to derive four metrics assessing the likelihood for a sample to be mislabeled. For each metric, three thresholds are set to maximize the classifying performance on a corrupted validation dataset when using three different ensemble classifiers, namely Bagging, AdaBoost and k-nearest neighbor (k-NN). These thresholds are used to identify and then either remove or correct the corrupted samples. The effectiveness of the ENDM is demonstrated in performing the classification of 15 public datasets. A comparative analysis is conducted concerning the homogeneous-ensembles-based majority vote method and consensus vote method, two popular ensemble-based label noise filters.
format Online
Article
Text
id pubmed-7727820
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher MDPI
record_format MEDLINE/PubMed
spelling pubmed-77278202020-12-11 Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric Feng, Wei Quan, Yinghui Dauphin, Gabriel Sensors (Basel) Article Real-world datasets are often contaminated with label noise; labeling is not a clear-cut process and reliable methods tend to be expensive or time-consuming. Depending on the learning technique used, such label noise is potentially harmful, requiring an increased size of the training set, making the trained model more complex and more prone to overfitting and yielding less accurate prediction. This work proposes a cleaning technique called the ensemble method based on the noise detection metric (ENDM). From the corrupted training set, an ensemble classifier is first learned and used to derive four metrics assessing the likelihood for a sample to be mislabeled. For each metric, three thresholds are set to maximize the classifying performance on a corrupted validation dataset when using three different ensemble classifiers, namely Bagging, AdaBoost and k-nearest neighbor (k-NN). These thresholds are used to identify and then either remove or correct the corrupted samples. The effectiveness of the ENDM is demonstrated in performing the classification of 15 public datasets. A comparative analysis is conducted concerning the homogeneous-ensembles-based majority vote method and consensus vote method, two popular ensemble-based label noise filters. MDPI 2020-11-24 /pmc/articles/PMC7727820/ /pubmed/33255363 http://dx.doi.org/10.3390/s20236718 Text en © 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
spellingShingle Article
Feng, Wei
Quan, Yinghui
Dauphin, Gabriel
Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric
title Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric
title_full Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric
title_fullStr Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric
title_full_unstemmed Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric
title_short Label Noise Cleaning with an Adaptive Ensemble Method Based on Noise Detection Metric
title_sort label noise cleaning with an adaptive ensemble method based on noise detection metric
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7727820/
https://www.ncbi.nlm.nih.gov/pubmed/33255363
http://dx.doi.org/10.3390/s20236718
work_keys_str_mv AT fengwei labelnoisecleaningwithanadaptiveensemblemethodbasedonnoisedetectionmetric
AT quanyinghui labelnoisecleaningwithanadaptiveensemblemethodbasedonnoisedetectionmetric
AT dauphingabriel labelnoisecleaningwithanadaptiveensemblemethodbasedonnoisedetectionmetric