Cargando…

Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants

When researchers speak of BM25, it is not entirely clear which variant they mean, since many tweaks to Robertson et al.’s original formulation have been proposed. When practitioners speak of BM25, they most likely refer to the implementation in the Lucene open-source search library. Does this ambigu...

Descripción completa

Detalles Bibliográficos
Autores principales: Kamphuis, Chris, de Vries, Arjen P., Boytsov, Leonid, Lin, Jimmy
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148026/
http://dx.doi.org/10.1007/978-3-030-45442-5_4
_version_ 1783520514499674112
author Kamphuis, Chris
de Vries, Arjen P.
Boytsov, Leonid
Lin, Jimmy
author_facet Kamphuis, Chris
de Vries, Arjen P.
Boytsov, Leonid
Lin, Jimmy
author_sort Kamphuis, Chris
collection PubMed
description When researchers speak of BM25, it is not entirely clear which variant they mean, since many tweaks to Robertson et al.’s original formulation have been proposed. When practitioners speak of BM25, they most likely refer to the implementation in the Lucene open-source search library. Does this ambiguity “matter”? We attempt to answer this question with a large-scale reproducibility study of BM25, considering eight variants. Experiments on three newswire collections show that there are no significant effectiveness differences between them, including Lucene’s often maligned approximation of document length. As an added benefit, our empirical approach takes advantage of databases for rapid IR prototyping, which validates both the feasibility and methodological advantages claimed in previous work.
format Online
Article
Text
id pubmed-7148026
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-71480262020-04-13 Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants Kamphuis, Chris de Vries, Arjen P. Boytsov, Leonid Lin, Jimmy Advances in Information Retrieval Article When researchers speak of BM25, it is not entirely clear which variant they mean, since many tweaks to Robertson et al.’s original formulation have been proposed. When practitioners speak of BM25, they most likely refer to the implementation in the Lucene open-source search library. Does this ambiguity “matter”? We attempt to answer this question with a large-scale reproducibility study of BM25, considering eight variants. Experiments on three newswire collections show that there are no significant effectiveness differences between them, including Lucene’s often maligned approximation of document length. As an added benefit, our empirical approach takes advantage of databases for rapid IR prototyping, which validates both the feasibility and methodological advantages claimed in previous work. 2020-03-24 /pmc/articles/PMC7148026/ http://dx.doi.org/10.1007/978-3-030-45442-5_4 Text en © Springer Nature Switzerland AG 2020 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Kamphuis, Chris
de Vries, Arjen P.
Boytsov, Leonid
Lin, Jimmy
Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants
title Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants
title_full Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants
title_fullStr Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants
title_full_unstemmed Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants
title_short Which BM25 Do You Mean? A Large-Scale Reproducibility Study of Scoring Variants
title_sort which bm25 do you mean? a large-scale reproducibility study of scoring variants
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7148026/
http://dx.doi.org/10.1007/978-3-030-45442-5_4
work_keys_str_mv AT kamphuischris whichbm25doyoumeanalargescalereproducibilitystudyofscoringvariants
AT devriesarjenp whichbm25doyoumeanalargescalereproducibilitystudyofscoringvariants
AT boytsovleonid whichbm25doyoumeanalargescalereproducibilitystudyofscoringvariants
AT linjimmy whichbm25doyoumeanalargescalereproducibilitystudyofscoringvariants