Cargando…

Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval

The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical datase...

Descripción completa

Detalles Bibliográficos
Autores principales: Karisani, Payam, Qin, Zhaohui S, Agichtein, Eugene
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5887275/
https://www.ncbi.nlm.nih.gov/pubmed/29688379
http://dx.doi.org/10.1093/database/bax104
_version_ 1783312264489598976
author Karisani, Payam
Qin, Zhaohui S
Agichtein, Eugene
author_facet Karisani, Payam
Qin, Zhaohui S
Agichtein, Eugene
author_sort Karisani, Payam
collection PubMed
description The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie
format Online
Article
Text
id pubmed-5887275
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-58872752018-04-11 Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval Karisani, Payam Qin, Zhaohui S Agichtein, Eugene Database (Oxford) Original Article The bioCADDIE dataset retrieval challenge brought together different approaches to retrieval of biomedical datasets relevant to a user’s query, expressed as a text description of a needed dataset. We describe experiments in applying a data-driven, machine learning-based approach to biomedical dataset retrieval as part of this challenge. We report on a series of experiments carried out to evaluate the performance of both probabilistic and machine learning-driven techniques from information retrieval, as applied to this challenge. Our experiments with probabilistic information retrieval methods, such as query term weight optimization, automatic query expansion and simulated user relevance feedback, demonstrate that automatically boosting the weights of important keywords in a verbose query is more effective than other methods. We also show that although there is a rich space of potential representations and features available in this domain, machine learning-based re-ranking models are not able to improve on probabilistic information retrieval techniques with the currently available training data. The models and algorithms presented in this paper can serve as a viable implementation of a search engine to provide access to biomedical datasets. The retrieval performance is expected to be further improved by using additional training data that is created by expert annotation, or gathered through usage logs, clicks and other processes during natural operation of the system. Database URL: https://github.com/emory-irlab/biocaddie Oxford University Press 2018-03-28 /pmc/articles/PMC5887275/ /pubmed/29688379 http://dx.doi.org/10.1093/database/bax104 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Karisani, Payam
Qin, Zhaohui S
Agichtein, Eugene
Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
title Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
title_full Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
title_fullStr Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
title_full_unstemmed Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
title_short Probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
title_sort probabilistic and machine learning-based retrieval approaches for biomedical dataset retrieval
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5887275/
https://www.ncbi.nlm.nih.gov/pubmed/29688379
http://dx.doi.org/10.1093/database/bax104
work_keys_str_mv AT karisanipayam probabilisticandmachinelearningbasedretrievalapproachesforbiomedicaldatasetretrieval
AT qinzhaohuis probabilisticandmachinelearningbasedretrievalapproachesforbiomedicaldatasetretrieval
AT agichteineugene probabilisticandmachinelearningbasedretrievalapproachesforbiomedicaldatasetretrieval