Cargando…

Trends in the production of scientific data analysis resources

BACKGROUND: As the amount of scientific data grows, peer-reviewed Scientific Data Analysis Resources (SDARs) such as published software programs, databases and web servers have had a strong impact on the productivity of scientific research. SDARs are typically linked to using an Internet URL, which...

Descripción completa

Detalles Bibliográficos
Autores principales: Hennessey, Jason, Georgescu, Constantin, Wren, Jonathan D
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251054/
https://www.ncbi.nlm.nih.gov/pubmed/25350391
http://dx.doi.org/10.1186/1471-2105-15-S11-S7
_version_ 1782346995711606784
author Hennessey, Jason
Georgescu, Constantin
Wren, Jonathan D
author_facet Hennessey, Jason
Georgescu, Constantin
Wren, Jonathan D
author_sort Hennessey, Jason
collection PubMed
description BACKGROUND: As the amount of scientific data grows, peer-reviewed Scientific Data Analysis Resources (SDARs) such as published software programs, databases and web servers have had a strong impact on the productivity of scientific research. SDARs are typically linked to using an Internet URL, which have been shown to decay in a time-dependent fashion. What is less clear is whether or not SDAR-producing group size or prior experience in SDAR production correlates with SDAR persistence or whether certain institutions or regions account for a disproportionate number of peer-reviewed resources. METHODS: We first quantified the current availability of over 26,000 unique URLs published in MEDLINE abstracts/titles over the past 20 years, then extracted authorship, institutional and ZIP code data. We estimated which URLs were SDARs by using keyword proximity analysis. RESULTS: We identified 23,820 non-archival URLs produced between 1996 and 2013, out of which 11,977 were classified as SDARs. Production of SDARs as measured with the Gini coefficient is more widely distributed among institutions (.62) and ZIP codes (.65) than scientific research in general, which tends to be disproportionately clustered within elite institutions (.91) and ZIPs (.96). An estimated one percent of institutions produced 68% of published research whereas the top 1% only accounted for 16% of SDARs. Some labs produced many SDARs (maximum detected = 64), but 74% of SDAR-producing authors have only published one SDAR. Interestingly, decayed SDARs have significantly fewer average authors (4.33 +/- 3.06), than available SDARs (4.88 +/- 3.59) (p < 8.32 × 10(-4)). Approximately 3.4% of URLs, as published, contain errors in their entry/format, including DOIs and links to clinical trials registry numbers. CONCLUSION: SDAR production is less dependent upon institutional location and resources, and SDAR online persistence does not seem to be a function of infrastructure or expertise. Yet, SDAR team size correlates positively with SDAR accessibility, suggesting a possible sociological factor involved. While a detectable URL entry error rate of 3.4% is relatively low, it raises the question of whether or not this is a general error rate that extends to additional published entities.
format Online
Article
Text
id pubmed-4251054
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-42510542014-12-04 Trends in the production of scientific data analysis resources Hennessey, Jason Georgescu, Constantin Wren, Jonathan D BMC Bioinformatics Proceedings BACKGROUND: As the amount of scientific data grows, peer-reviewed Scientific Data Analysis Resources (SDARs) such as published software programs, databases and web servers have had a strong impact on the productivity of scientific research. SDARs are typically linked to using an Internet URL, which have been shown to decay in a time-dependent fashion. What is less clear is whether or not SDAR-producing group size or prior experience in SDAR production correlates with SDAR persistence or whether certain institutions or regions account for a disproportionate number of peer-reviewed resources. METHODS: We first quantified the current availability of over 26,000 unique URLs published in MEDLINE abstracts/titles over the past 20 years, then extracted authorship, institutional and ZIP code data. We estimated which URLs were SDARs by using keyword proximity analysis. RESULTS: We identified 23,820 non-archival URLs produced between 1996 and 2013, out of which 11,977 were classified as SDARs. Production of SDARs as measured with the Gini coefficient is more widely distributed among institutions (.62) and ZIP codes (.65) than scientific research in general, which tends to be disproportionately clustered within elite institutions (.91) and ZIPs (.96). An estimated one percent of institutions produced 68% of published research whereas the top 1% only accounted for 16% of SDARs. Some labs produced many SDARs (maximum detected = 64), but 74% of SDAR-producing authors have only published one SDAR. Interestingly, decayed SDARs have significantly fewer average authors (4.33 +/- 3.06), than available SDARs (4.88 +/- 3.59) (p < 8.32 × 10(-4)). Approximately 3.4% of URLs, as published, contain errors in their entry/format, including DOIs and links to clinical trials registry numbers. CONCLUSION: SDAR production is less dependent upon institutional location and resources, and SDAR online persistence does not seem to be a function of infrastructure or expertise. Yet, SDAR team size correlates positively with SDAR accessibility, suggesting a possible sociological factor involved. While a detectable URL entry error rate of 3.4% is relatively low, it raises the question of whether or not this is a general error rate that extends to additional published entities. BioMed Central 2014-10-21 /pmc/articles/PMC4251054/ /pubmed/25350391 http://dx.doi.org/10.1186/1471-2105-15-S11-S7 Text en Copyright © 2014 Hennessey et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Proceedings
Hennessey, Jason
Georgescu, Constantin
Wren, Jonathan D
Trends in the production of scientific data analysis resources
title Trends in the production of scientific data analysis resources
title_full Trends in the production of scientific data analysis resources
title_fullStr Trends in the production of scientific data analysis resources
title_full_unstemmed Trends in the production of scientific data analysis resources
title_short Trends in the production of scientific data analysis resources
title_sort trends in the production of scientific data analysis resources
topic Proceedings
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251054/
https://www.ncbi.nlm.nih.gov/pubmed/25350391
http://dx.doi.org/10.1186/1471-2105-15-S11-S7
work_keys_str_mv AT hennesseyjason trendsintheproductionofscientificdataanalysisresources
AT georgescuconstantin trendsintheproductionofscientificdataanalysisresources
AT wrenjonathand trendsintheproductionofscientificdataanalysisresources