Cargando…

A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository

It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data res...

Descripción completa

Detalles Bibliográficos
Autores principales: Patra, Braja Gopal, Roberts, Kirk, Wu, Hulin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7659921/
https://www.ncbi.nlm.nih.gov/pubmed/33002137
http://dx.doi.org/10.1093/database/baaa064
_version_ 1783608896094470144
author Patra, Braja Gopal
Roberts, Kirk
Wu, Hulin
author_facet Patra, Braja Gopal
Roberts, Kirk
Wu, Hulin
author_sort Patra, Braja Gopal
collection PubMed
description It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers’ workload in identifying the right dataset and increase the reusability of biomedical datasets. Database URL: http://genestudy.org/recommends/#/
format Online
Article
Text
id pubmed-7659921
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-76599212020-11-18 A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository Patra, Braja Gopal Roberts, Kirk Wu, Hulin Database (Oxford) Original Article It is a growing trend among researchers to make their data publicly available for experimental reproducibility and data reusability. Sharing data with fellow researchers helps in increasing the visibility of the work. On the other hand, there are researchers who are inhibited by the lack of data resources. To overcome this challenge, many repositories and knowledge bases have been established to date to ease data sharing. Further, in the past two decades, there has been an exponential increase in the number of datasets added to these dataset repositories. However, most of these repositories are domain-specific, and none of them can recommend datasets to researchers/users. Naturally, it is challenging for a researcher to keep track of all the relevant repositories for potential use. Thus, a dataset recommender system that recommends datasets to a researcher based on previous publications can enhance their productivity and expedite further research. This work adopts an information retrieval (IR) paradigm for dataset recommendation. We hypothesize that two fundamental differences exist between dataset recommendation and PubMed-style biomedical IR beyond the corpus. First, instead of keywords, the query is the researcher, embodied by his or her publications. Second, to filter the relevant datasets from non-relevant ones, researchers are better represented by a set of interests, as opposed to the entire body of their research. This second approach is implemented using a non-parametric clustering technique. These clusters are used to recommend datasets for each researcher using the cosine similarity between the vector representations of publication clusters and datasets. The maximum normalized discounted cumulative gain at 10 (NDCG@10), precision at 10 (p@10) partial and p@10 strict of 0.89, 0.78 and 0.61, respectively, were obtained using the proposed method after manual evaluation by five researchers. As per the best of our knowledge, this is the first study of its kind on content-based dataset recommendation. We hope that this system will further promote data sharing, offset the researchers’ workload in identifying the right dataset and increase the reusability of biomedical datasets. Database URL: http://genestudy.org/recommends/#/ Oxford University Press 2020-11-12 /pmc/articles/PMC7659921/ /pubmed/33002137 http://dx.doi.org/10.1093/database/baaa064 Text en © The Author(s) 2020. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Article
Patra, Braja Gopal
Roberts, Kirk
Wu, Hulin
A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository
title A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository
title_full A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository
title_fullStr A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository
title_full_unstemmed A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository
title_short A content-based dataset recommendation system for researchers—a case study on Gene Expression Omnibus (GEO) repository
title_sort content-based dataset recommendation system for researchers—a case study on gene expression omnibus (geo) repository
topic Original Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7659921/
https://www.ncbi.nlm.nih.gov/pubmed/33002137
http://dx.doi.org/10.1093/database/baaa064
work_keys_str_mv AT patrabrajagopal acontentbaseddatasetrecommendationsystemforresearchersacasestudyongeneexpressionomnibusgeorepository
AT robertskirk acontentbaseddatasetrecommendationsystemforresearchersacasestudyongeneexpressionomnibusgeorepository
AT wuhulin acontentbaseddatasetrecommendationsystemforresearchersacasestudyongeneexpressionomnibusgeorepository
AT patrabrajagopal contentbaseddatasetrecommendationsystemforresearchersacasestudyongeneexpressionomnibusgeorepository
AT robertskirk contentbaseddatasetrecommendationsystemforresearchersacasestudyongeneexpressionomnibusgeorepository
AT wuhulin contentbaseddatasetrecommendationsystemforresearchersacasestudyongeneexpressionomnibusgeorepository