Cargando…
Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning
Author name disambiguation in bibliographic databases is the problem of grouping together scientific publications written by the same person, accounting for potential homonyms and/or synonyms. Among solutions to this problem, digital libraries are increasingly offering tools for authors to manually...
Autores principales: | , , , |
---|---|
Lenguaje: | eng |
Publicado: |
2015
|
Materias: | |
Acceso en línea: | https://dx.doi.org/10.1007/978-3-319-45880-9_21 http://cds.cern.ch/record/2222878 |
_version_ | 1780952382809571328 |
---|---|
author | Louppe, Gilles Al-Natsheh, Hussein T. Susik, Mateusz Maguire, Eamonn James |
author_facet | Louppe, Gilles Al-Natsheh, Hussein T. Susik, Mateusz Maguire, Eamonn James |
author_sort | Louppe, Gilles |
collection | CERN |
description | Author name disambiguation in bibliographic databases is the problem of
grouping together scientific publications written by the same person,
accounting for potential homonyms and/or synonyms. Among solutions to this
problem, digital libraries are increasingly offering tools for authors to
manually curate their publications and claim those that are theirs. Indirectly,
these tools allow for the inexpensive collection of large annotated training
data, which can be further leveraged to build a complementary automated
disambiguation system capable of inferring patterns for identifying
publications written by the same person. Building on more than 1 million
publicly released crowdsourced annotations, we propose an automated author
disambiguation solution exploiting this data (i) to learn an accurate
classifier for identifying coreferring authors and (ii) to guide the clustering
of scientific publications by distinct authors in a semi-supervised way. To the
best of our knowledge, our analysis is the first to be carried out on data of
this size and coverage. With respect to the state of the art, we validate the
general pipeline used in most existing solutions, and improve by: (i) proposing
phonetic-based blocking strategies, thereby increasing recall; and (ii) adding
strong ethnicity-sensitive features for learning a linkage function, thereby
tailoring disambiguation to non-Western author names whenever necessary. |
id | oai-inspirehep.net-1487780 |
institution | Organización Europea para la Investigación Nuclear |
language | eng |
publishDate | 2015 |
record_format | invenio |
spelling | oai-inspirehep.net-14877802023-03-14T19:33:04Zdoi:10.1007/978-3-319-45880-9_21http://cds.cern.ch/record/2222878engLouppe, GillesAl-Natsheh, Hussein T.Susik, MateuszMaguire, Eamonn JamesEthnicity Sensitive Author Disambiguation Using Semi-supervised LearningComputing and Computerscs.DLcs.IRstat.MLAuthor name disambiguation in bibliographic databases is the problem of grouping together scientific publications written by the same person, accounting for potential homonyms and/or synonyms. Among solutions to this problem, digital libraries are increasingly offering tools for authors to manually curate their publications and claim those that are theirs. Indirectly, these tools allow for the inexpensive collection of large annotated training data, which can be further leveraged to build a complementary automated disambiguation system capable of inferring patterns for identifying publications written by the same person. Building on more than 1 million publicly released crowdsourced annotations, we propose an automated author disambiguation solution exploiting this data (i) to learn an accurate classifier for identifying coreferring authors and (ii) to guide the clustering of scientific publications by distinct authors in a semi-supervised way. To the best of our knowledge, our analysis is the first to be carried out on data of this size and coverage. With respect to the state of the art, we validate the general pipeline used in most existing solutions, and improve by: (i) proposing phonetic-based blocking strategies, thereby increasing recall; and (ii) adding strong ethnicity-sensitive features for learning a linkage function, thereby tailoring disambiguation to non-Western author names whenever necessary.arXiv:1508.07744oai:inspirehep.net:14877802015-08-31 |
spellingShingle | Computing and Computers cs.DL cs.IR stat.ML Louppe, Gilles Al-Natsheh, Hussein T. Susik, Mateusz Maguire, Eamonn James Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning |
title | Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning |
title_full | Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning |
title_fullStr | Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning |
title_full_unstemmed | Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning |
title_short | Ethnicity Sensitive Author Disambiguation Using Semi-supervised Learning |
title_sort | ethnicity sensitive author disambiguation using semi-supervised learning |
topic | Computing and Computers cs.DL cs.IR stat.ML |
url | https://dx.doi.org/10.1007/978-3-319-45880-9_21 http://cds.cern.ch/record/2222878 |
work_keys_str_mv | AT louppegilles ethnicitysensitiveauthordisambiguationusingsemisupervisedlearning AT alnatshehhusseint ethnicitysensitiveauthordisambiguationusingsemisupervisedlearning AT susikmateusz ethnicitysensitiveauthordisambiguationusingsemisupervisedlearning AT maguireeamonnjames ethnicitysensitiveauthordisambiguationusingsemisupervisedlearning |