Cargando…

Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge

The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets...

Descripción completa

Detalles Bibliográficos
Autores principales: Wei, Wei, Ji, Zhanglong, He, Yupeng, Zhang, Kai, Ha, Yuanchi, Li, Qi, Ohno-Machado, Lucila
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2018
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5861401/
https://www.ncbi.nlm.nih.gov/pubmed/29688374
http://dx.doi.org/10.1093/database/bay017
_version_ 1783308085676212224
author Wei, Wei
Ji, Zhanglong
He, Yupeng
Zhang, Kai
Ha, Yuanchi
Li, Qi
Ohno-Machado, Lucila
author_facet Wei, Wei
Ji, Zhanglong
He, Yupeng
Zhang, Kai
Ha, Yuanchi
Li, Qi
Ohno-Machado, Lucila
author_sort Wei, Wei
collection PubMed
description The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline
format Online
Article
Text
id pubmed-5861401
institution National Center for Biotechnology Information
language English
publishDate 2018
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-58614012018-03-28 Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge Wei, Wei Ji, Zhanglong He, Yupeng Zhang, Kai Ha, Yuanchi Li, Qi Ohno-Machado, Lucila Database (Oxford) Original Article The number and diversity of biomedical datasets grew rapidly in the last decade. A large number of datasets are stored in various repositories, with different formats. Existing dataset retrieval systems lack the capability of cross-repository search. As a result, users spend time searching datasets in known repositories, and they typically do not find new repositories. The biomedical and healthcare data discovery index ecosystem (bioCADDIE) team organized a challenge to solicit new indexing and searching strategies for retrieving biomedical datasets across repositories. We describe the work of one team that built a retrieval pipeline and examined its performance. The pipeline used online resources to supplement dataset metadata, automatically generated queries from users’ free-text questions, produced high-quality retrieval results and achieved the highest inferred Normalized Discounted Cumulative Gain among competitors. The results showed that it is a promising solution for cross-database, cross-domain and cross-repository biomedical dataset retrieval. Database URL: https://github.com/w2wei/dataset_retrieval_pipeline Oxford University Press 2018-03-16 /pmc/articles/PMC5861401/ /pubmed/29688374 http://dx.doi.org/10.1093/database/bay017 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Wei, Wei
Ji, Zhanglong
He, Yupeng
Zhang, Kai
Ha, Yuanchi
Li, Qi
Ohno-Machado, Lucila
Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge
title Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge
title_full Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge
title_fullStr Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge
title_full_unstemmed Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge
title_short Finding relevant biomedical datasets: the UC San Diego solution for the bioCADDIE Retrieval Challenge
title_sort finding relevant biomedical datasets: the uc san diego solution for the biocaddie retrieval challenge
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5861401/
https://www.ncbi.nlm.nih.gov/pubmed/29688374
http://dx.doi.org/10.1093/database/bay017
work_keys_str_mv AT weiwei findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge
AT jizhanglong findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge
AT heyupeng findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge
AT zhangkai findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge
AT hayuanchi findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge
AT liqi findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge
AT ohnomachadolucila findingrelevantbiomedicaldatasetstheucsandiegosolutionforthebiocaddieretrievalchallenge