Cargando…

Multi-field query expansion is effective for biomedical dataset retrieval

In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query...

Descripción completa

Detalles Bibliográficos
Autores principales: Bouadjenek, Mohamed Reda, Verspoor, Karin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737205/
https://www.ncbi.nlm.nih.gov/pubmed/29220457
http://dx.doi.org/10.1093/database/bax062
_version_ 1783287484285714432
author Bouadjenek, Mohamed Reda
Verspoor, Karin
author_facet Bouadjenek, Mohamed Reda
Verspoor, Karin
author_sort Bouadjenek, Mohamed Reda
collection PubMed
description In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one.
format Online
Article
Text
id pubmed-5737205
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-57372052018-01-08 Multi-field query expansion is effective for biomedical dataset retrieval Bouadjenek, Mohamed Reda Verspoor, Karin Database (Oxford) Original Article In the context of the bioCADDIE challenge addressing information retrieval of biomedical datasets, we propose a method for retrieval of biomedical data sets with heterogenous schemas through query reformulation. In particular, the method proposed transforms the initial query into a multi-field query that is then enriched with terms that are likely to occur in the relevant datasets. We compare and evaluate two query expansion strategies, one based on the Rocchio method and another based on a biomedical lexicon. We then perform a comprehensive comparative evaluation of our method on the bioCADDIE dataset collection for biomedical retrieval. We demonstrate the effectiveness of our multi-field query method compared to two baselines, with MAP improved from 0.2171 and 0.2669 to 0.2996. We also show the benefits of query expansion, where the Rocchio expanstion method improves the MAP for our two baselines from 0.2171 and 0.2669 to 0.335. We show that the Rocchio query expansion method slightly outperforms the one based on the biomedical lexicon as a source of terms, with an improvement of roughly 3% for MAP. However, the query expansion method based on the biomedical lexicon is much less resource intensive since it does not require computation of any relevance feedback set or any initial execution of the query. Hence, in term of trade-off between efficiency, execution time and retrieval accuracy, we argue that the query expansion method based on the biomedical lexicon offers the best performance for a prototype biomedical data search engine intended to be used at a large scale. In the official bioCADDIE challenge results, although our approach is ranked seventh in terms of the infNDCG evaluation metric, it ranks second in term of P@10 and NDCG. Hence, the method proposed here provides overall good retrieval performance in relation to the approaches of other competitors. Consequently, the observations made in this paper should benefit the development of a Data Discovery Index prototype or the improvement of the existing one. Oxford University Press 2017-09-07 /pmc/articles/PMC5737205/ /pubmed/29220457 http://dx.doi.org/10.1093/database/bax062 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Bouadjenek, Mohamed Reda
Verspoor, Karin
Multi-field query expansion is effective for biomedical dataset retrieval
title Multi-field query expansion is effective for biomedical dataset retrieval
title_full Multi-field query expansion is effective for biomedical dataset retrieval
title_fullStr Multi-field query expansion is effective for biomedical dataset retrieval
title_full_unstemmed Multi-field query expansion is effective for biomedical dataset retrieval
title_short Multi-field query expansion is effective for biomedical dataset retrieval
title_sort multi-field query expansion is effective for biomedical dataset retrieval
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5737205/
https://www.ncbi.nlm.nih.gov/pubmed/29220457
http://dx.doi.org/10.1093/database/bax062
work_keys_str_mv AT bouadjenekmohamedreda multifieldqueryexpansioniseffectiveforbiomedicaldatasetretrieval
AT verspoorkarin multifieldqueryexpansioniseffectiveforbiomedicaldatasetretrieval