Cargando…
Trends in the production of scientific data analysis resources
BACKGROUND: As the amount of scientific data grows, peer-reviewed Scientific Data Analysis Resources (SDARs) such as published software programs, databases and web servers have had a strong impact on the productivity of scientific research. SDARs are typically linked to using an Internet URL, which...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2014
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251054/ https://www.ncbi.nlm.nih.gov/pubmed/25350391 http://dx.doi.org/10.1186/1471-2105-15-S11-S7 |
_version_ | 1782346995711606784 |
---|---|
author | Hennessey, Jason Georgescu, Constantin Wren, Jonathan D |
author_facet | Hennessey, Jason Georgescu, Constantin Wren, Jonathan D |
author_sort | Hennessey, Jason |
collection | PubMed |
description | BACKGROUND: As the amount of scientific data grows, peer-reviewed Scientific Data Analysis Resources (SDARs) such as published software programs, databases and web servers have had a strong impact on the productivity of scientific research. SDARs are typically linked to using an Internet URL, which have been shown to decay in a time-dependent fashion. What is less clear is whether or not SDAR-producing group size or prior experience in SDAR production correlates with SDAR persistence or whether certain institutions or regions account for a disproportionate number of peer-reviewed resources. METHODS: We first quantified the current availability of over 26,000 unique URLs published in MEDLINE abstracts/titles over the past 20 years, then extracted authorship, institutional and ZIP code data. We estimated which URLs were SDARs by using keyword proximity analysis. RESULTS: We identified 23,820 non-archival URLs produced between 1996 and 2013, out of which 11,977 were classified as SDARs. Production of SDARs as measured with the Gini coefficient is more widely distributed among institutions (.62) and ZIP codes (.65) than scientific research in general, which tends to be disproportionately clustered within elite institutions (.91) and ZIPs (.96). An estimated one percent of institutions produced 68% of published research whereas the top 1% only accounted for 16% of SDARs. Some labs produced many SDARs (maximum detected = 64), but 74% of SDAR-producing authors have only published one SDAR. Interestingly, decayed SDARs have significantly fewer average authors (4.33 +/- 3.06), than available SDARs (4.88 +/- 3.59) (p < 8.32 × 10(-4)). Approximately 3.4% of URLs, as published, contain errors in their entry/format, including DOIs and links to clinical trials registry numbers. CONCLUSION: SDAR production is less dependent upon institutional location and resources, and SDAR online persistence does not seem to be a function of infrastructure or expertise. Yet, SDAR team size correlates positively with SDAR accessibility, suggesting a possible sociological factor involved. While a detectable URL entry error rate of 3.4% is relatively low, it raises the question of whether or not this is a general error rate that extends to additional published entities. |
format | Online Article Text |
id | pubmed-4251054 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2014 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-42510542014-12-04 Trends in the production of scientific data analysis resources Hennessey, Jason Georgescu, Constantin Wren, Jonathan D BMC Bioinformatics Proceedings BACKGROUND: As the amount of scientific data grows, peer-reviewed Scientific Data Analysis Resources (SDARs) such as published software programs, databases and web servers have had a strong impact on the productivity of scientific research. SDARs are typically linked to using an Internet URL, which have been shown to decay in a time-dependent fashion. What is less clear is whether or not SDAR-producing group size or prior experience in SDAR production correlates with SDAR persistence or whether certain institutions or regions account for a disproportionate number of peer-reviewed resources. METHODS: We first quantified the current availability of over 26,000 unique URLs published in MEDLINE abstracts/titles over the past 20 years, then extracted authorship, institutional and ZIP code data. We estimated which URLs were SDARs by using keyword proximity analysis. RESULTS: We identified 23,820 non-archival URLs produced between 1996 and 2013, out of which 11,977 were classified as SDARs. Production of SDARs as measured with the Gini coefficient is more widely distributed among institutions (.62) and ZIP codes (.65) than scientific research in general, which tends to be disproportionately clustered within elite institutions (.91) and ZIPs (.96). An estimated one percent of institutions produced 68% of published research whereas the top 1% only accounted for 16% of SDARs. Some labs produced many SDARs (maximum detected = 64), but 74% of SDAR-producing authors have only published one SDAR. Interestingly, decayed SDARs have significantly fewer average authors (4.33 +/- 3.06), than available SDARs (4.88 +/- 3.59) (p < 8.32 × 10(-4)). Approximately 3.4% of URLs, as published, contain errors in their entry/format, including DOIs and links to clinical trials registry numbers. CONCLUSION: SDAR production is less dependent upon institutional location and resources, and SDAR online persistence does not seem to be a function of infrastructure or expertise. Yet, SDAR team size correlates positively with SDAR accessibility, suggesting a possible sociological factor involved. While a detectable URL entry error rate of 3.4% is relatively low, it raises the question of whether or not this is a general error rate that extends to additional published entities. BioMed Central 2014-10-21 /pmc/articles/PMC4251054/ /pubmed/25350391 http://dx.doi.org/10.1186/1471-2105-15-S11-S7 Text en Copyright © 2014 Hennessey et al.; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/4.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Proceedings Hennessey, Jason Georgescu, Constantin Wren, Jonathan D Trends in the production of scientific data analysis resources |
title | Trends in the production of scientific data analysis resources |
title_full | Trends in the production of scientific data analysis resources |
title_fullStr | Trends in the production of scientific data analysis resources |
title_full_unstemmed | Trends in the production of scientific data analysis resources |
title_short | Trends in the production of scientific data analysis resources |
title_sort | trends in the production of scientific data analysis resources |
topic | Proceedings |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4251054/ https://www.ncbi.nlm.nih.gov/pubmed/25350391 http://dx.doi.org/10.1186/1471-2105-15-S11-S7 |
work_keys_str_mv | AT hennesseyjason trendsintheproductionofscientificdataanalysisresources AT georgescuconstantin trendsintheproductionofscientificdataanalysisresources AT wrenjonathand trendsintheproductionofscientificdataanalysisresources |