Cargando…

Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning

Author name disambiguation in bibliographic databases is the problem of grouping together scientific publications written by the same person, accounting for potential homonyms and/or synonyms. Among solutions to this problem, digital libraries are increasingly offering tools for authors to manually...

Descripción completa

Detalles Bibliográficos
Autores principales: Louppe, Gilles, Al-Natsheh, Hussein T., Susik, Mateusz, Maguire, Eamonn James
Lenguaje:eng
Publicado: 2015
Materias:
Acceso en línea:https://dx.doi.org/10.1007/978-3-319-45880-9_21
http://cds.cern.ch/record/2222878
_version_ 1780952382809571328
author Louppe, Gilles
Al-Natsheh, Hussein T.
Susik, Mateusz
Maguire, Eamonn James
author_facet Louppe, Gilles
Al-Natsheh, Hussein T.
Susik, Mateusz
Maguire, Eamonn James
author_sort Louppe, Gilles
collection CERN
description Author name disambiguation in bibliographic databases is the problem of grouping together scientific publications written by the same person, accounting for potential homonyms and/or synonyms. Among solutions to this problem, digital libraries are increasingly offering tools for authors to manually curate their publications and claim those that are theirs. Indirectly, these tools allow for the inexpensive collection of large annotated training data, which can be further leveraged to build a complementary automated disambiguation system capable of inferring patterns for identifying publications written by the same person. Building on more than 1 million publicly released crowdsourced annotations, we propose an automated author disambiguation solution exploiting this data (i) to learn an accurate classifier for identifying coreferring authors and (ii) to guide the clustering of scientific publications by distinct authors in a semi-supervised way. To the best of our knowledge, our analysis is the first to be carried out on data of this size and coverage. With respect to the state of the art, we validate the general pipeline used in most existing solutions, and improve by: (i) proposing phonetic-based blocking strategies, thereby increasing recall; and (ii) adding strong ethnicity-sensitive features for learning a linkage function, thereby tailoring disambiguation to non-Western author names whenever necessary.
id oai-inspirehep.net-1487780
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2015
record_format invenio
spelling oai-inspirehep.net-14877802023-03-14T19:33:04Zdoi:10.1007/978-3-319-45880-9_21http://cds.cern.ch/record/2222878engLouppe, GillesAl-Natsheh, Hussein T.Susik, MateuszMaguire, Eamonn JamesEthnicity Sensitive Author Disambiguation Using Semi-supervised LearningComputing and Computerscs.DLcs.IRstat.MLAuthor name disambiguation in bibliographic databases is the problem of grouping together scientific publications written by the same person, accounting for potential homonyms and/or synonyms. Among solutions to this problem, digital libraries are increasingly offering tools for authors to manually curate their publications and claim those that are theirs. Indirectly, these tools allow for the inexpensive collection of large annotated training data, which can be further leveraged to build a complementary automated disambiguation system capable of inferring patterns for identifying publications written by the same person. Building on more than 1 million publicly released crowdsourced annotations, we propose an automated author disambiguation solution exploiting this data (i) to learn an accurate classifier for identifying coreferring authors and (ii) to guide the clustering of scientific publications by distinct authors in a semi-supervised way. To the best of our knowledge, our analysis is the first to be carried out on data of this size and coverage. With respect to the state of the art, we validate the general pipeline used in most existing solutions, and improve by: (i) proposing phonetic-based blocking strategies, thereby increasing recall; and (ii) adding strong ethnicity-sensitive features for learning a linkage function, thereby tailoring disambiguation to non-Western author names whenever necessary.arXiv:1508.07744oai:inspirehep.net:14877802015-08-31
spellingShingle Computing and Computers
cs.DL
cs.IR
stat.ML
Louppe, Gilles
Al-Natsheh, Hussein T.
Susik, Mateusz
Maguire, Eamonn James
Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
title Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
title_full Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
title_fullStr Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
title_full_unstemmed Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
title_short Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
title_sort ethnicity sensitive author disambiguation using semi-supervised learning
topic Computing and Computers
cs.DL
cs.IR
stat.ML
url https://dx.doi.org/10.1007/978-3-319-45880-9_21
http://cds.cern.ch/record/2222878
work_keys_str_mv AT louppegilles ethnicitysensitiveauthordisambiguationusingsemisupervisedlearning
AT alnatshehhusseint ethnicitysensitiveauthordisambiguationusingsemisupervisedlearning
AT susikmateusz ethnicitysensitiveauthordisambiguationusingsemisupervisedlearning
AT maguireeamonnjames ethnicitysensitiveauthordisambiguationusingsemisupervisedlearning