Cargando…

Author Clustering on Large Bibliographies

We analyze and design an algorithm for clustering large sets of authors in Bibliographies. Not considering a distance function for a mutual comparison, but transforming the data into a multidimensional metric space, the algorithm described is similar to locally sensitive hashing. The task lies in th...

Descripción completa

Detalles Bibliográficos
Autor principal: Sterz, Christoph
Lenguaje:eng
Publicado: 2014
Materias:
Acceso en línea:http://cds.cern.ch/record/1752207
_version_ 1780943176793587712
author Sterz, Christoph
author_facet Sterz, Christoph
author_sort Sterz, Christoph
collection CERN
description We analyze and design an algorithm for clustering large sets of authors in Bibliographies. Not considering a distance function for a mutual comparison, but transforming the data into a multidimensional metric space, the algorithm described is similar to locally sensitive hashing. The task lies in the field of Record-Linkage. The algorithm was designed and performed based on the data of the CERN Document Server, consisting out of more than 1.7 million metadata entries and is part of the digital assets-managing-software invenio. Meant as a prototype, the algorithm performs efficiently, clustering all authors on CDS in under 30 minutes. We will discuss extensions improving the recall rate, wich still remains inferior to the currently used clustering-approach.
id cern-1752207
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2014
record_format invenio
spelling cern-17522072019-09-30T06:29:59Zhttp://cds.cern.ch/record/1752207engSterz, ChristophAuthor Clustering on Large BibliographiesComputing and ComputersInformation Transfer and ManagementWe analyze and design an algorithm for clustering large sets of authors in Bibliographies. Not considering a distance function for a mutual comparison, but transforming the data into a multidimensional metric space, the algorithm described is similar to locally sensitive hashing. The task lies in the field of Record-Linkage. The algorithm was designed and performed based on the data of the CERN Document Server, consisting out of more than 1.7 million metadata entries and is part of the digital assets-managing-software invenio. Meant as a prototype, the algorithm performs efficiently, clustering all authors on CDS in under 30 minutes. We will discuss extensions improving the recall rate, wich still remains inferior to the currently used clustering-approach.CERN-STUDENTS-Note-2014-128oai:cds.cern.ch:17522072014-08-28
spellingShingle Computing and Computers
Information Transfer and Management
Sterz, Christoph
Author Clustering on Large Bibliographies
title Author Clustering on Large Bibliographies
title_full Author Clustering on Large Bibliographies
title_fullStr Author Clustering on Large Bibliographies
title_full_unstemmed Author Clustering on Large Bibliographies
title_short Author Clustering on Large Bibliographies
title_sort author clustering on large bibliographies
topic Computing and Computers
Information Transfer and Management
url http://cds.cern.ch/record/1752207
work_keys_str_mv AT sterzchristoph authorclusteringonlargebibliographies