Cargando…

Active label cleaning for improved dataset quality under resource constraints

Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resour...

Descripción completa

Detalles Bibliográficos
Autores principales: Bernhardt, Mélanie, Castro, Daniel C., Tanno, Ryutaro, Schwaighofer, Anton, Tezcan, Kerem C., Monteiro, Miguel, Bannur, Shruthi, Lungren, Matthew P., Nori, Aditya, Glocker, Ben, Alvarez-Valle, Javier, Oktay, Ozan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Nature Publishing Group UK 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8897392/
https://www.ncbi.nlm.nih.gov/pubmed/35246539
http://dx.doi.org/10.1038/s41467-022-28818-3
Descripción
Sumario:Imperfections in data annotation, known as label noise, are detrimental to the training of machine learning models and have a confounding effect on the assessment of model performance. Nevertheless, employing experts to remove label noise by fully re-annotating large datasets is infeasible in resource-constrained settings, such as healthcare. This work advocates for a data-driven approach to prioritising samples for re-annotation—which we term “active label cleaning". We propose to rank instances according to estimated label correctness and labelling difficulty of each sample, and introduce a simulation framework to evaluate relabelling efficacy. Our experiments on natural images and on a specifically-devised medical imaging benchmark show that cleaning noisy labels mitigates their negative impact on model training, evaluation, and selection. Crucially, the proposed approach enables correcting labels up to 4 × more effectively than typical random selection in realistic conditions, making better use of experts’ valuable time for improving dataset quality.