Cargando…

Author Clustering on Large Bibliographies

We analyze and design an algorithm for clustering large sets of authors in Bibliographies. Not considering a distance function for a mutual comparison, but transforming the data into a multidimensional metric space, the algorithm described is similar to locally sensitive hashing. The task lies in th...

Descripción completa

Detalles Bibliográficos
Autor principal: Sterz, Christoph
Lenguaje:eng
Publicado: 2014
Materias:
Acceso en línea:http://cds.cern.ch/record/1752207
Descripción
Sumario:We analyze and design an algorithm for clustering large sets of authors in Bibliographies. Not considering a distance function for a mutual comparison, but transforming the data into a multidimensional metric space, the algorithm described is similar to locally sensitive hashing. The task lies in the field of Record-Linkage. The algorithm was designed and performed based on the data of the CERN Document Server, consisting out of more than 1.7 million metadata entries and is part of the digital assets-managing-software invenio. Meant as a prototype, the algorithm performs efficiently, clustering all authors on CDS in under 30 minutes. We will discuss extensions improving the recall rate, wich still remains inferior to the currently used clustering-approach.