Cargando…
The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines
Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Althoug...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148013/ http://dx.doi.org/10.1007/978-3-030-45442-5_2 |
_version_ | 1783520511431540736 |
---|---|
author | Fröbe, Maik Bittner, Jan Philipp Potthast, Martin Hagen, Matthias |
author_facet | Fröbe, Maik Bittner, Jan Philipp Potthast, Martin Hagen, Matthias |
author_sort | Fröbe, Maik |
collection | PubMed |
description | Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Although these negative effects have already been demonstrated a long time ago by Bernstein and Zobel [4], we find that this has failed to move the community. In this paper, we reproduce the aforementioned study and extend it to incorporate all TREC Terabyte, Web, and Core tracks. The worst-case penalty of having filtered duplicates in any of these tracks were losses between 8 and 53 ranks. |
format | Online Article Text |
id | pubmed-7148013 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
record_format | MEDLINE/PubMed |
spelling | pubmed-71480132020-04-13 The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines Fröbe, Maik Bittner, Jan Philipp Potthast, Martin Hagen, Matthias Advances in Information Retrieval Article Current best practices for the evaluation of search engines do not take into account duplicate documents. Dependent on their prevalence, not discounting duplicates during evaluation artificially inflates performance scores, and, it penalizes those whose search systems diligently filter them. Although these negative effects have already been demonstrated a long time ago by Bernstein and Zobel [4], we find that this has failed to move the community. In this paper, we reproduce the aforementioned study and extend it to incorporate all TREC Terabyte, Web, and Core tracks. The worst-case penalty of having filtered duplicates in any of these tracks were losses between 8 and 53 ranks. 2020-03-24 /pmc/articles/PMC7148013/ http://dx.doi.org/10.1007/978-3-030-45442-5_2 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Fröbe, Maik Bittner, Jan Philipp Potthast, Martin Hagen, Matthias The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines |
title | The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines |
title_full | The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines |
title_fullStr | The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines |
title_full_unstemmed | The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines |
title_short | The Effect of Content-Equivalent Near-Duplicates on the Evaluation of Search Engines |
title_sort | effect of content-equivalent near-duplicates on the evaluation of search engines |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148013/ http://dx.doi.org/10.1007/978-3-030-45442-5_2 |
work_keys_str_mv | AT frobemaik theeffectofcontentequivalentnearduplicatesontheevaluationofsearchengines AT bittnerjanphilipp theeffectofcontentequivalentnearduplicatesontheevaluationofsearchengines AT potthastmartin theeffectofcontentequivalentnearduplicatesontheevaluationofsearchengines AT hagenmatthias theeffectofcontentequivalentnearduplicatesontheevaluationofsearchengines AT frobemaik effectofcontentequivalentnearduplicatesontheevaluationofsearchengines AT bittnerjanphilipp effectofcontentequivalentnearduplicatesontheevaluationofsearchengines AT potthastmartin effectofcontentequivalentnearduplicatesontheevaluationofsearchengines AT hagenmatthias effectofcontentequivalentnearduplicatesontheevaluationofsearchengines |