Cargando…

The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines

Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Althoug...

Descripción completa

Detalles Bibliográficos
Autores principales: Fröbe, Maik, Bittner, Jan Philipp, Potthast, Martin, Hagen, Matthias
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148013/
http://dx.doi.org/10.1007/978-3-030-45442-5_2
_version_ 1783520511431540736
author Fröbe, Maik
Bittner, Jan Philipp
Potthast, Martin
Hagen, Matthias
author_facet Fröbe, Maik
Bittner, Jan Philipp
Potthast, Martin
Hagen, Matthias
author_sort Fröbe, Maik
collection PubMed
description Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Although these negative effects have already been demonstrated a long time ago by Bernstein and Zobel [4], we find that this has failed to move the community. In this paper, we reproduce the aforementioned study and extend it to incorporate all TREC Terabyte, Web, and Core tracks. The worst-case penalty of having filtered duplicates in any of these tracks were losses between 8 and 53 ranks.
format Online
Article
Text
id pubmed-7148013
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-71480132020-04-13 The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines Fröbe, Maik Bittner, Jan Philipp Potthast, Martin Hagen, Matthias Advances in Information Retrieval Article Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Although these negative effects have already been demonstrated a long time ago by Bernstein and Zobel [4], we find that this has failed to move the community. In this paper, we reproduce the aforementioned study and extend it to incorporate all TREC Terabyte, Web, and Core tracks. The worst-case penalty of having filtered duplicates in any of these tracks were losses between 8 and 53 ranks. 2020-03-24 /pmc/articles/PMC7148013/ http://dx.doi.org/10.1007/978-3-030-45442-5_2 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Fröbe, Maik
Bittner, Jan Philipp
Potthast, Martin
Hagen, Matthias
The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines
title The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines
title_full The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines
title_fullStr The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines
title_full_unstemmed The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines
title_short The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines
title_sort effect of content-equivalent near-duplicates on the evaluation of search engines
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148013/
http://dx.doi.org/10.1007/978-3-030-45442-5_2
work_keys_str_mv AT frobemaik theeffectofcontentequivalentnearduplicatesontheevaluationofsearchengines
AT bittnerjanphilipp theeffectofcontentequivalentnearduplicatesontheevaluationofsearchengines
AT potthastmartin theeffectofcontentequivalentnearduplicatesontheevaluationofsearchengines
AT hagenmatthias theeffectofcontentequivalentnearduplicatesontheevaluationofsearchengines
AT frobemaik effectofcontentequivalentnearduplicatesontheevaluationofsearchengines
AT bittnerjanphilipp effectofcontentequivalentnearduplicatesontheevaluationofsearchengines
AT potthastmartin effectofcontentequivalentnearduplicatesontheevaluationofsearchengines
AT hagenmatthias effectofcontentequivalentnearduplicatesontheevaluationofsearchengines