Cargando…

A systematic approach to normalization in probabilistic models

Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent t...

Descripción completa

Detalles Bibliográficos
Autores principales:	Lipani, Aldo, Roelleke, Thomas, Lupu, Mihai, Hanbury, Allan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Springer Netherlands 2018
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6208902/ https://www.ncbi.nlm.nih.gov/pubmed/30416369 http://dx.doi.org/10.1007/s10791-018-9334-1

_version_	1783366804246102016
author	Lipani, Aldo Roelleke, Thomas Lupu, Mihai Hanbury, Allan
author_facet	Lipani, Aldo Roelleke, Thomas Lupu, Mihai Hanbury, Allan
author_sort	Lipani, Aldo
collection	PubMed
description	Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost.
format	Online Article Text
id	pubmed-6208902
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Springer Netherlands
record_format	MEDLINE/PubMed
spelling	pubmed-62089022018-11-09 A systematic approach to normalization in probabilistic models Lipani, Aldo Roelleke, Thomas Lupu, Mihai Hanbury, Allan Inf Retr Boston Article Every information retrieval (IR) model embeds in its scoring function a form of term frequency (TF) quantification. The contribution of the term frequency is determined by the properties of the function of the chosen TF quantification, and by its TF normalization. The first defines how independent the occurrences of multiple terms are, while the second acts on mitigating the a priori probability of having a high term frequency in a document (estimation usually based on the document length). New test collections, coming from different domains (e.g. medical, legal), give evidence that not only document length, but in addition, verboseness of documents should be explicitly considered. Therefore we propose and investigate a systematic combination of document verboseness and length. To theoretically justify the combination, we show the duality between document verboseness and length. In addition, we investigate the duality between verboseness and other components of IR models. We test these new TF normalizations on four suitable test collections. We do this on a well defined spectrum of TF quantifications. Finally, based on the theoretical and experimental observations, we show how the two components of this new normalization, document verboseness and length, interact with each other. Our experiments demonstrate that the new models never underperform existing models, while sometimes introducing statistically significantly better results, at no additional computational cost. Springer Netherlands 2018-06-30 2018 /pmc/articles/PMC6208902/ /pubmed/30416369 http://dx.doi.org/10.1007/s10791-018-9334-1 Text en © The Author(s) 2018 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.
spellingShingle	Article Lipani, Aldo Roelleke, Thomas Lupu, Mihai Hanbury, Allan A systematic approach to normalization in probabilistic models
title	A systematic approach to normalization in probabilistic models
title_full	A systematic approach to normalization in probabilistic models
title_fullStr	A systematic approach to normalization in probabilistic models
title_full_unstemmed	A systematic approach to normalization in probabilistic models
title_short	A systematic approach to normalization in probabilistic models
title_sort	systematic approach to normalization in probabilistic models
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6208902/ https://www.ncbi.nlm.nih.gov/pubmed/30416369 http://dx.doi.org/10.1007/s10791-018-9334-1
work_keys_str_mv	AT lipanialdo asystematicapproachtonormalizationinprobabilisticmodels AT roellekethomas asystematicapproachtonormalizationinprobabilisticmodels AT lupumihai asystematicapproachtonormalizationinprobabilisticmodels AT hanburyallan asystematicapproachtonormalizationinprobabilisticmodels AT lipanialdo systematicapproachtonormalizationinprobabilisticmodels AT roellekethomas systematicapproachtonormalizationinprobabilisticmodels AT lupumihai systematicapproachtonormalizationinprobabilisticmodels AT hanburyallan systematicapproachtonormalizationinprobabilisticmodels

A systematic approach to normalization in probabilistic models

Ejemplares similares