Cargando…

Unsupervised Bootstrapping of Active Learning for Entity Resolution

Entity resolution is one of the central challenges when integrating data from large numbers of data sources. Active learning for entity resolution aims to learn high-quality matching models while minimizing the human labeling effort by selecting only the most informative record pairs for labeling. M...

Descripción completa

Detalles Bibliográficos
Autores principales: Primpeli, Anna, Bizer, Christian, Keuper, Margret
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7250605/
http://dx.doi.org/10.1007/978-3-030-49461-2_13
_version_ 1783538794497048576
author Primpeli, Anna
Bizer, Christian
Keuper, Margret
author_facet Primpeli, Anna
Bizer, Christian
Keuper, Margret
author_sort Primpeli, Anna
collection PubMed
description Entity resolution is one of the central challenges when integrating data from large numbers of data sources. Active learning for entity resolution aims to learn high-quality matching models while minimizing the human labeling effort by selecting only the most informative record pairs for labeling. Most active learning methods proposed so far, start with an empty set of labeled record pairs and iteratively improve the prediction quality of a classification model by asking for new labels. The absence of adequate labeled data in the early active learning iterations leads to unstable models of low quality which is known as the cold start problem. In our work we solve the cold start problem using an unsupervised matching method to bootstrap active learning. We implement a thresholding heuristic that considers pre-calculated similarity scores and assigns matching labels with some degree of noise at no manual labeling cost. The noisy labels are used for initializing the active learning process and throughout the whole active learning cycle for model learning and query selection. We evaluate our pipeline with six datasets from three different entity resolution settings using active learning with a committee-based query strategy and show it successfully deals with the cold start problem. Comparing our method against two active learning baselines without bootstrapping, we show that it can additionally lead to overall improved learned models in terms of [Formula: see text] score and stability.
format Online
Article
Text
id pubmed-7250605
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-72506052020-05-27 Unsupervised Bootstrapping of Active Learning for Entity Resolution Primpeli, Anna Bizer, Christian Keuper, Margret The Semantic Web Article Entity resolution is one of the central challenges when integrating data from large numbers of data sources. Active learning for entity resolution aims to learn high-quality matching models while minimizing the human labeling effort by selecting only the most informative record pairs for labeling. Most active learning methods proposed so far, start with an empty set of labeled record pairs and iteratively improve the prediction quality of a classification model by asking for new labels. The absence of adequate labeled data in the early active learning iterations leads to unstable models of low quality which is known as the cold start problem. In our work we solve the cold start problem using an unsupervised matching method to bootstrap active learning. We implement a thresholding heuristic that considers pre-calculated similarity scores and assigns matching labels with some degree of noise at no manual labeling cost. The noisy labels are used for initializing the active learning process and throughout the whole active learning cycle for model learning and query selection. We evaluate our pipeline with six datasets from three different entity resolution settings using active learning with a committee-based query strategy and show it successfully deals with the cold start problem. Comparing our method against two active learning baselines without bootstrapping, we show that it can additionally lead to overall improved learned models in terms of [Formula: see text] score and stability. 2020-05-07 /pmc/articles/PMC7250605/ http://dx.doi.org/10.1007/978-3-030-49461-2_13 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Primpeli, Anna
Bizer, Christian
Keuper, Margret
Unsupervised Bootstrapping of Active Learning for Entity Resolution
title Unsupervised Bootstrapping of Active Learning for Entity Resolution
title_full Unsupervised Bootstrapping of Active Learning for Entity Resolution
title_fullStr Unsupervised Bootstrapping of Active Learning for Entity Resolution
title_full_unstemmed Unsupervised Bootstrapping of Active Learning for Entity Resolution
title_short Unsupervised Bootstrapping of Active Learning for Entity Resolution
title_sort unsupervised bootstrapping of active learning for entity resolution
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7250605/
http://dx.doi.org/10.1007/978-3-030-49461-2_13
work_keys_str_mv AT primpelianna unsupervisedbootstrappingofactivelearningforentityresolution
AT bizerchristian unsupervisedbootstrappingofactivelearningforentityresolution
AT keupermargret unsupervisedbootstrappingofactivelearningforentityresolution