Cargando…

Search Datasets in Literature: A Case Study of GWAS

One of the missions of the NIH BD2K (Big Data to Knowledge) initiative is to make data discoverable and promote the re-use of existing datasets. Our ultimate goal is to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. Usin...

Descripción completa

Detalles Bibliográficos
Autores principales: Dong, Xiao, Zhang, Yaoyun, Xu, Hua
Formato: Online Artículo Texto
Lenguaje:English
Publicado: American Medical Informatics Association 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543360/
https://www.ncbi.nlm.nih.gov/pubmed/28815103
_version_ 1783255135359598592
author Dong, Xiao
Zhang, Yaoyun
Xu, Hua
author_facet Dong, Xiao
Zhang, Yaoyun
Xu, Hua
author_sort Dong, Xiao
collection PubMed
description One of the missions of the NIH BD2K (Big Data to Knowledge) initiative is to make data discoverable and promote the re-use of existing datasets. Our ultimate goal is to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. Using Genome-Wide Association Studies (GWAS) as a use case, we conducted an initial study to identify GWAS dataset attributes in MEDLINE abstracts, by developing a hybrid approach that combines domain dictionaries and pattern-based rules. The automatic GWAS dataset attribute recognition system achieved an F-measure of 84.85%. We further applied the GWAS attribute recognition system to indexing MEDLINE abstracts and built an online GWAS dataset search engine called “GWAS Dataset Finder”. Our evaluation showed that the GWAS Dataset Finder outperformed PubMed significantly in retrieving literature with desired datasets. Our study demonstrates the potential application of text mining methods in building the data discovery index. It can create a better index of literature linked with their underlying data sets, thus improving data discoverability.
format Online
Article
Text
id pubmed-5543360
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher American Medical Informatics Association
record_format MEDLINE/PubMed
spelling pubmed-55433602017-08-16 Search Datasets in Literature: A Case Study of GWAS Dong, Xiao Zhang, Yaoyun Xu, Hua AMIA Jt Summits Transl Sci Proc Articles One of the missions of the NIH BD2K (Big Data to Knowledge) initiative is to make data discoverable and promote the re-use of existing datasets. Our ultimate goal is to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. Using Genome-Wide Association Studies (GWAS) as a use case, we conducted an initial study to identify GWAS dataset attributes in MEDLINE abstracts, by developing a hybrid approach that combines domain dictionaries and pattern-based rules. The automatic GWAS dataset attribute recognition system achieved an F-measure of 84.85%. We further applied the GWAS attribute recognition system to indexing MEDLINE abstracts and built an online GWAS dataset search engine called “GWAS Dataset Finder”. Our evaluation showed that the GWAS Dataset Finder outperformed PubMed significantly in retrieving literature with desired datasets. Our study demonstrates the potential application of text mining methods in building the data discovery index. It can create a better index of literature linked with their underlying data sets, thus improving data discoverability. American Medical Informatics Association 2017-07-26 /pmc/articles/PMC5543360/ /pubmed/28815103 Text en ©2017 AMIA - All rights reserved. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose
spellingShingle Articles
Dong, Xiao
Zhang, Yaoyun
Xu, Hua
Search Datasets in Literature: A Case Study of GWAS
title Search Datasets in Literature: A Case Study of GWAS
title_full Search Datasets in Literature: A Case Study of GWAS
title_fullStr Search Datasets in Literature: A Case Study of GWAS
title_full_unstemmed Search Datasets in Literature: A Case Study of GWAS
title_short Search Datasets in Literature: A Case Study of GWAS
title_sort search datasets in literature: a case study of gwas
topic Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543360/
https://www.ncbi.nlm.nih.gov/pubmed/28815103
work_keys_str_mv AT dongxiao searchdatasetsinliteratureacasestudyofgwas
AT zhangyaoyun searchdatasetsinliteratureacasestudyofgwas
AT xuhua searchdatasetsinliteratureacasestudyofgwas