Cargando…
Search Datasets in Literature: A Case Study of GWAS
One of the missions of the NIH BD2K (Big Data to Knowledge) initiative is to make data discoverable and promote the re-use of existing datasets. Our ultimate goal is to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. Usin...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
American Medical Informatics Association
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543360/ https://www.ncbi.nlm.nih.gov/pubmed/28815103 |
_version_ | 1783255135359598592 |
---|---|
author | Dong, Xiao Zhang, Yaoyun Xu, Hua |
author_facet | Dong, Xiao Zhang, Yaoyun Xu, Hua |
author_sort | Dong, Xiao |
collection | PubMed |
description | One of the missions of the NIH BD2K (Big Data to Knowledge) initiative is to make data discoverable and promote the re-use of existing datasets. Our ultimate goal is to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. Using Genome-Wide Association Studies (GWAS) as a use case, we conducted an initial study to identify GWAS dataset attributes in MEDLINE abstracts, by developing a hybrid approach that combines domain dictionaries and pattern-based rules. The automatic GWAS dataset attribute recognition system achieved an F-measure of 84.85%. We further applied the GWAS attribute recognition system to indexing MEDLINE abstracts and built an online GWAS dataset search engine called “GWAS Dataset Finder”. Our evaluation showed that the GWAS Dataset Finder outperformed PubMed significantly in retrieving literature with desired datasets. Our study demonstrates the potential application of text mining methods in building the data discovery index. It can create a better index of literature linked with their underlying data sets, thus improving data discoverability. |
format | Online Article Text |
id | pubmed-5543360 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | American Medical Informatics Association |
record_format | MEDLINE/PubMed |
spelling | pubmed-55433602017-08-16 Search Datasets in Literature: A Case Study of GWAS Dong, Xiao Zhang, Yaoyun Xu, Hua AMIA Jt Summits Transl Sci Proc Articles One of the missions of the NIH BD2K (Big Data to Knowledge) initiative is to make data discoverable and promote the re-use of existing datasets. Our ultimate goal is to develop a scalable approach that can automatically scan millions of scientific publications and identify underlying data sets. Using Genome-Wide Association Studies (GWAS) as a use case, we conducted an initial study to identify GWAS dataset attributes in MEDLINE abstracts, by developing a hybrid approach that combines domain dictionaries and pattern-based rules. The automatic GWAS dataset attribute recognition system achieved an F-measure of 84.85%. We further applied the GWAS attribute recognition system to indexing MEDLINE abstracts and built an online GWAS dataset search engine called “GWAS Dataset Finder”. Our evaluation showed that the GWAS Dataset Finder outperformed PubMed significantly in retrieving literature with desired datasets. Our study demonstrates the potential application of text mining methods in building the data discovery index. It can create a better index of literature linked with their underlying data sets, thus improving data discoverability. American Medical Informatics Association 2017-07-26 /pmc/articles/PMC5543360/ /pubmed/28815103 Text en ©2017 AMIA - All rights reserved. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose |
spellingShingle | Articles Dong, Xiao Zhang, Yaoyun Xu, Hua Search Datasets in Literature: A Case Study of GWAS |
title | Search Datasets in Literature: A Case Study of GWAS |
title_full | Search Datasets in Literature: A Case Study of GWAS |
title_fullStr | Search Datasets in Literature: A Case Study of GWAS |
title_full_unstemmed | Search Datasets in Literature: A Case Study of GWAS |
title_short | Search Datasets in Literature: A Case Study of GWAS |
title_sort | search datasets in literature: a case study of gwas |
topic | Articles |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5543360/ https://www.ncbi.nlm.nih.gov/pubmed/28815103 |
work_keys_str_mv | AT dongxiao searchdatasetsinliteratureacasestudyofgwas AT zhangyaoyun searchdatasetsinliteratureacasestudyofgwas AT xuhua searchdatasetsinliteratureacasestudyofgwas |