Cargando…
Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge
The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets...
Autores principales: | , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2018
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5861401/ https://www.ncbi.nlm.nih.gov/pubmed/29688374 http://dx.doi.org/10.1093/database/bay017 |
_version_ | 1783308085676212224 |
---|---|
author | Wei, Wei Ji, Zhanglong He, Yupeng Zhang, Kai Ha, Yuanchi Li, Qi Ohno-Machado, Lucila |
author_facet | Wei, Wei Ji, Zhanglong He, Yupeng Zhang, Kai Ha, Yuanchi Li, Qi Ohno-Machado, Lucila |
author_sort | Wei, Wei |
collection | PubMed |
description | The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline |
format | Online Article Text |
id | pubmed-5861401 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2018 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-58614012018-03-28 Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge Wei, Wei Ji, Zhanglong He, Yupeng Zhang, Kai Ha, Yuanchi Li, Qi Ohno-Machado, Lucila Database (Oxford) Original Article The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline Oxford University Press 2018-03-16 /pmc/articles/PMC5861401/ /pubmed/29688374 http://dx.doi.org/10.1093/database/bay017 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Article Wei, Wei Ji, Zhanglong He, Yupeng Zhang, Kai Ha, Yuanchi Li, Qi Ohno-Machado, Lucila Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge |
title | Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge |
title_full | Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge |
title_fullStr | Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge |
title_full_unstemmed | Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge |
title_short | Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge |
title_sort | finding relevant biomedical datasets: the uc san diego solution for the biocaddie retrieval challenge |
topic | Original Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5861401/ https://www.ncbi.nlm.nih.gov/pubmed/29688374 http://dx.doi.org/10.1093/database/bay017 |
work_keys_str_mv | AT weiwei findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge AT jizhanglong findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge AT heyupeng findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge AT zhangkai findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge AT hayuanchi findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge AT liqi findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge AT ohnomachadolucila findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge |