Cargando…
Active label cleaning for improved dataset quality under resource constraints
Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resour...
Autores principales: | , , , , , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Nature Publishing Group UK
2022
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8897392/ https://www.ncbi.nlm.nih.gov/pubmed/35246539 http://dx.doi.org/10.1038/s41467-022-28818-3 |
_version_ | 1784663392105005056 |
---|---|
author | Bernhardt, Mélanie Castro, Daniel C. Tanno, Ryutaro Schwaighofer, Anton Tezcan, Kerem C. Monteiro, Miguel Bannur, Shruthi Lungren, Matthew P. Nori, Aditya Glocker, Ben Alvarez-Valle, Javier Oktay, Ozan |
author_facet | Bernhardt, Mélanie Castro, Daniel C. Tanno, Ryutaro Schwaighofer, Anton Tezcan, Kerem C. Monteiro, Miguel Bannur, Shruthi Lungren, Matthew P. Nori, Aditya Glocker, Ben Alvarez-Valle, Javier Oktay, Ozan |
author_sort | Bernhardt, Mélanie |
collection | PubMed |
description | Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation—which we term “active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts’ valuable time for improving dataset quality. |
format | Online Article Text |
id | pubmed-8897392 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2022 |
publisher | Nature Publishing Group UK |
record_format | MEDLINE/PubMed |
spelling | pubmed-88973922022-03-17 Active label cleaning for improved dataset quality under resource constraints Bernhardt, Mélanie Castro, Daniel C. Tanno, Ryutaro Schwaighofer, Anton Tezcan, Kerem C. Monteiro, Miguel Bannur, Shruthi Lungren, Matthew P. Nori, Aditya Glocker, Ben Alvarez-Valle, Javier Oktay, Ozan Nat Commun Article Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation—which we term “active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts’ valuable time for improving dataset quality. Nature Publishing Group UK 2022-03-04 /pmc/articles/PMC8897392/ /pubmed/35246539 http://dx.doi.org/10.1038/s41467-022-28818-3 Text en © The Author(s) 2022 https://creativecommons.org/licenses/by/4.0/Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) . |
spellingShingle | Article Bernhardt, Mélanie Castro, Daniel C. Tanno, Ryutaro Schwaighofer, Anton Tezcan, Kerem C. Monteiro, Miguel Bannur, Shruthi Lungren, Matthew P. Nori, Aditya Glocker, Ben Alvarez-Valle, Javier Oktay, Ozan Active label cleaning for improved dataset quality under resource constraints |
title | Active label cleaning for improved dataset quality under resource constraints |
title_full | Active label cleaning for improved dataset quality under resource constraints |
title_fullStr | Active label cleaning for improved dataset quality under resource constraints |
title_full_unstemmed | Active label cleaning for improved dataset quality under resource constraints |
title_short | Active label cleaning for improved dataset quality under resource constraints |
title_sort | active label cleaning for improved dataset quality under resource constraints |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8897392/ https://www.ncbi.nlm.nih.gov/pubmed/35246539 http://dx.doi.org/10.1038/s41467-022-28818-3 |
work_keys_str_mv | AT bernhardtmelanie activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT castrodanielc activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT tannoryutaro activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT schwaighoferanton activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT tezcankeremc activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT monteiromiguel activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT bannurshruthi activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT lungrenmatthewp activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT noriaditya activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT glockerben activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT alvarezvallejavier activelabelcleaningforimproveddatasetqualityunderresourceconstraints AT oktayozan activelabelcleaningforimproveddatasetqualityunderresourceconstraints |