Cargando…

Improving average ranking precision in user searches for biomedical research datasets

Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their sear...

Descripción completa

Detalles Bibliográficos
Autores principales:	Teodoro, Douglas, Mottin, Luc, Gobeill, Julien, Gaudinat, Arnaud, Vachon, Thérèse, Ruch, Patrick
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2017
Materias:	Original Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5714153/ https://www.ncbi.nlm.nih.gov/pubmed/29220475 http://dx.doi.org/10.1093/database/bax083

_version_	1783283533749420032
author	Teodoro, Douglas Mottin, Luc Gobeill, Julien Gaudinat, Arnaud Vachon, Thérèse Ruch, Patrick
author_facet	Teodoro, Douglas Mottin, Luc Gobeill, Julien Gaudinat, Arnaud Vachon, Thérèse Ruch, Patrick
author_sort	Teodoro, Douglas
collection	PubMed
description	Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results. Database URL: https://biocaddie.org/benchmark-data
format	Online Article Text
id	pubmed-5714153
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-57141532017-12-08 Improving average ranking precision in user searches for biomedical research datasets Teodoro, Douglas Mottin, Luc Gobeill, Julien Gaudinat, Arnaud Vachon, Thérèse Ruch, Patrick Database (Oxford) Original Article Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results. Database URL: https://biocaddie.org/benchmark-data Oxford University Press 2017-11-06 /pmc/articles/PMC5714153/ /pubmed/29220475 http://dx.doi.org/10.1093/database/bax083 Text en © The Author(s) 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Original Article Teodoro, Douglas Mottin, Luc Gobeill, Julien Gaudinat, Arnaud Vachon, Thérèse Ruch, Patrick Improving average ranking precision in user searches for biomedical research datasets
title	Improving average ranking precision in user searches for biomedical research datasets
title_full	Improving average ranking precision in user searches for biomedical research datasets
title_fullStr	Improving average ranking precision in user searches for biomedical research datasets
title_full_unstemmed	Improving average ranking precision in user searches for biomedical research datasets
title_short	Improving average ranking precision in user searches for biomedical research datasets
title_sort	improving average ranking precision in user searches for biomedical research datasets
topic	Original Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5714153/ https://www.ncbi.nlm.nih.gov/pubmed/29220475 http://dx.doi.org/10.1093/database/bax083
work_keys_str_mv	AT teodorodouglas improvingaveragerankingprecisioninusersearchesforbiomedicalresearchdatasets AT mottinluc improvingaveragerankingprecisioninusersearchesforbiomedicalresearchdatasets AT gobeilljulien improvingaveragerankingprecisioninusersearchesforbiomedicalresearchdatasets AT gaudinatarnaud improvingaveragerankingprecisioninusersearchesforbiomedicalresearchdatasets AT vachontherese improvingaveragerankingprecisioninusersearchesforbiomedicalresearchdatasets AT ruchpatrick improvingaveragerankingprecisioninusersearchesforbiomedicalresearchdatasets

Improving average ranking precision in user searches for biomedical research datasets

Ejemplares similares