Cargando…
Author Clustering on Large Bibliographies
We analyze and design an algorithm for clustering large sets of authors in Bibliographies. Not considering a distance function for a mutual comparison, but transforming the data into a multidimensional metric space, the algorithm described is similar to locally sensitive hashing. The task lies in th...
Autor principal: | |
---|---|
Lenguaje: | eng |
Publicado: |
2014
|
Materias: | |
Acceso en línea: | http://cds.cern.ch/record/1752207 |
Sumario: | We analyze and design an algorithm for clustering large sets of authors in Bibliographies. Not considering a distance function for a mutual comparison, but transforming the data into a multidimensional metric space, the algorithm described is similar to locally sensitive hashing. The task lies in the field of Record-Linkage. The algorithm was designed and performed based on the data of the CERN Document Server, consisting out of more than 1.7 million metadata entries and is part of the digital assets-managing-software invenio. Meant as a prototype, the algorithm performs efficiently, clustering all authors on CDS in under 30 minutes. We will discuss extensions improving the recall rate, wich still remains inferior to the currently used clustering-approach. |
---|