Cargando…

Data sets for author name disambiguation: an empirical analysis and a new resource

Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we d...

Descripción completa

Detalles Bibliográficos
Autores principales: Müller, Mark-Christoph, Reitz, Florian, Roy, Nicolas
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Springer Netherlands 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5438420/
https://www.ncbi.nlm.nih.gov/pubmed/28596627
http://dx.doi.org/10.1007/s11192-017-2363-5
_version_ 1783237757069426688
author Müller, Mark-Christoph
Reitz, Florian
Roy, Nicolas
author_facet Müller, Mark-Christoph
Reitz, Florian
Roy, Nicolas
author_sort Müller, Mark-Christoph
collection PubMed
description Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community.
format Online
Article
Text
id pubmed-5438420
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Springer Netherlands
record_format MEDLINE/PubMed
spelling pubmed-54384202017-06-06 Data sets for author name disambiguation: an empirical analysis and a new resource Müller, Mark-Christoph Reitz, Florian Roy, Nicolas Scientometrics Article Data sets of publication meta data with manually disambiguated author names play an important role in current author name disambiguation (AND) research. We review the most important data sets used so far, and compare their respective advantages and shortcomings. From the results of this review, we derive a set of general requirements to future AND data sets. These include both trivial requirements, like absence of errors and preservation of author order, and more substantial ones, like full disambiguation and adequate representation of publications with a small number of authors and highly variable author names. On the basis of these requirements, we create and make publicly available a new AND data set, SCAD-zbMATH. Both the quantitative analysis of this data set and the results of our initial AND experiments with a naive baseline algorithm show the SCAD-zbMATH data set to be considerably different from existing ones. We consider it a useful new resource that will challenge the state of the art in AND and benefit the AND research community. Springer Netherlands 2017-03-27 2017 /pmc/articles/PMC5438420/ /pubmed/28596627 http://dx.doi.org/10.1007/s11192-017-2363-5 Text en © The Author(s) 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle Article
Müller, Mark-Christoph
Reitz, Florian
Roy, Nicolas
Data sets for author name disambiguation: an empirical analysis and a new resource
title Data sets for author name disambiguation: an empirical analysis and a new resource
title_full Data sets for author name disambiguation: an empirical analysis and a new resource
title_fullStr Data sets for author name disambiguation: an empirical analysis and a new resource
title_full_unstemmed Data sets for author name disambiguation: an empirical analysis and a new resource
title_short Data sets for author name disambiguation: an empirical analysis and a new resource
title_sort data sets for author name disambiguation: an empirical analysis and a new resource
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5438420/
https://www.ncbi.nlm.nih.gov/pubmed/28596627
http://dx.doi.org/10.1007/s11192-017-2363-5
work_keys_str_mv AT mullermarkchristoph datasetsforauthornamedisambiguationanempiricalanalysisandanewresource
AT reitzflorian datasetsforauthornamedisambiguationanempiricalanalysisandanewresource
AT roynicolas datasetsforauthornamedisambiguationanempiricalanalysisandanewresource