Cargando…

Wide-Open: Accelerating public data release by automating detection of overdue datasets

Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this pr...

Descripción completa

Detalles Bibliográficos
Autores principales: Grechkin, Maxim, Poon, Hoifung, Howe, Bill
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5464523/
https://www.ncbi.nlm.nih.gov/pubmed/28594819
http://dx.doi.org/10.1371/journal.pbio.2002477
_version_ 1783242787669409792
author Grechkin, Maxim
Poon, Hoifung
Howe, Bill
author_facet Grechkin, Maxim
Poon, Hoifung
Howe, Bill
author_sort Grechkin, Maxim
collection PubMed
description Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week.
format Online
Article
Text
id pubmed-5464523
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-54645232017-06-22 Wide-Open: Accelerating public data release by automating detection of overdue datasets Grechkin, Maxim Poon, Hoifung Howe, Bill PLoS Biol Community Page Open data is a vital pillar of open science and a key enabler for reproducibility, data reuse, and novel discoveries. Enforcement of open-data policies, however, largely relies on manual efforts, which invariably lag behind the increasingly automated generation of biological data. To address this problem, we developed a general approach to automatically identify datasets overdue for public release by applying text mining to identify dataset references in published articles and parse query results from repositories to determine if the datasets remain private. We demonstrate the effectiveness of this approach on 2 popular National Center for Biotechnology Information (NCBI) repositories: Gene Expression Omnibus (GEO) and Sequence Read Archive (SRA). Our Wide-Open system identified a large number of overdue datasets, which spurred administrators to respond directly by releasing 400 datasets in one week. Public Library of Science 2017-06-08 /pmc/articles/PMC5464523/ /pubmed/28594819 http://dx.doi.org/10.1371/journal.pbio.2002477 Text en © 2017 Grechkin et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Community Page
Grechkin, Maxim
Poon, Hoifung
Howe, Bill
Wide-Open: Accelerating public data release by automating detection of overdue datasets
title Wide-Open: Accelerating public data release by automating detection of overdue datasets
title_full Wide-Open: Accelerating public data release by automating detection of overdue datasets
title_fullStr Wide-Open: Accelerating public data release by automating detection of overdue datasets
title_full_unstemmed Wide-Open: Accelerating public data release by automating detection of overdue datasets
title_short Wide-Open: Accelerating public data release by automating detection of overdue datasets
title_sort wide-open: accelerating public data release by automating detection of overdue datasets
topic Community Page
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5464523/
https://www.ncbi.nlm.nih.gov/pubmed/28594819
http://dx.doi.org/10.1371/journal.pbio.2002477
work_keys_str_mv AT grechkinmaxim wideopenacceleratingpublicdatareleasebyautomatingdetectionofoverduedatasets
AT poonhoifung wideopenacceleratingpublicdatareleasebyautomatingdetectionofoverduedatasets
AT howebill wideopenacceleratingpublicdatareleasebyautomatingdetectionofoverduedatasets