Cargando…

Search Datasets in Literature: A Case Study of GWAS

One of the missions of the NIH BD2K (Big Data to Knowledge) initiative is to make data discoverable and promote the re-use of existing datasets. Our ultimate goal is to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. Usin...

Descripción completa

Detalles Bibliográficos
Autores principales: Dong, Xiao, Zhang, Yaoyun, Xu, Hua
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Medical Informatics Association 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543360/
https://www.ncbi.nlm.nih.gov/pubmed/28815103
Descripción
Sumario:One of the missions of the NIH BD2K (Big Data to Knowledge) initiative is to make data discoverable and promote the re-use of existing datasets. Our ultimate goal is to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. Using Genome-Wide Association Studies (GWAS) as a use case, we conducted an initial study to identify GWAS dataset attributes in MEDLINE abstracts, by developing a hybrid approach that combines domain dictionaries and pattern-based rules. The automatic GWAS dataset attribute recognition system achieved an F-measure of 84.85%. We further applied the GWAS attribute recognition system to indexing MEDLINE abstracts and built an online GWAS dataset search engine called “GWAS Dataset Finder”. Our evaluation showed that the GWAS Dataset Finder outperformed PubMed significantly in retrieving literature with desired datasets. Our study demonstrates the potential application of text mining methods in building the data discovery index. It can create a better index of literature linked with their underlying data sets, thus improving data discoverability.