Cargando…

Evaluating ranking methods on heterogeneous digital library collections

In the frame of research in particle physics, CERN has been developing its own web-based software /Invenio/ to run the digital library of all the documents related to CERN and fundamental physics. The documents (articles, photos, news, thesis, ...) can be retrieved through a search engine. The resul...

Descripción completa

Detalles Bibliográficos
Autor principal: Canévet, Olivier
Lenguaje:eng
Publicado: 2012
Materias:
Acceso en línea:http://cds.cern.ch/record/1479357
Descripción
Sumario:In the frame of research in particle physics, CERN has been developing its own web-based software /Invenio/ to run the digital library of all the documents related to CERN and fundamental physics. The documents (articles, photos, news, thesis, ...) can be retrieved through a search engine. The results matching the query of the user can be displayed in several ways: sorted by latest first, author, title and also ranked by word similarity. The purpose of this project is to study and implement a new ranking method in Invenio: distributed-ranking (D-Rank). This method aims at aggregating several ranking scores coming from different ranking methods into a new score. In addition to query-related scores such as word similarity, the goal of the work is to take into account non-query-related scores such as citations, journal impact factor and in particular scores related to the document access frequency in the database. The idea is that for two equally query-relevant documents, if one has been more downloaded for instance, it should be displayed in front of the other. The approach that we studied consists in using /logistic regression/ as the aggregation process, which is performed through a weighted sum of the scores to be aggregated. Usually, optimal weights can be computed based on the data. In our case, we used the user feedback: the search activity has been recorded for six months (queries made, displayed, downloaded documents,...) and we divided this data set in two: one to estimate the optimal coefficients and the other to test them. The test consisted in reranking the queries made by the users with the optimal coefficients. Then we compared the results with the initial ranking to see if the documents which were clicked at the time were ranked higher. The optimal coefficients obtained are coherent in the sense that negative attributes for a document got a negative coefficient in the logistic formula. But the order of magnitude between the logistic coefficients were unexpected, as query-relevant score was much lower than the others weights. The re-ranking of the queries showed some improvement for records which had already been downloaded in the database and which were ranked higher.