Cargando…

The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central

Science across all disciplines has become increasingly data-driven, leading to additional needs with respect to software for collecting, processing and analysing data. Thus, transparency about software used as part of the scientific process is crucial to understand provenance of individual research...

Descripción completa

Detalles Bibliográficos
Autores principales: Schindler, David, Bensmann, Felix, Dietze, Stefan, Krüger, Frank
Formato: Online Artículo Texto
Lenguaje:English
Publicado: PeerJ Inc. 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8771769/
https://www.ncbi.nlm.nih.gov/pubmed/35111920
http://dx.doi.org/10.7717/peerj-cs.835
_version_ 1784635686346817536
author Schindler, David
Bensmann, Felix
Dietze, Stefan
Krüger, Frank
author_facet Schindler, David
Bensmann, Felix
Dietze, Stefan
Krüger, Frank
author_sort Schindler, David
collection PubMed
description Science across all disciplines has become increasingly data-driven, leading to additional needs with respect to software for collecting, processing and analysing data. Thus, transparency about software used as part of the scientific process is crucial to understand provenance of individual research data and insights, is a prerequisite for reproducibility and can enable macro-analysis of the evolution of scientific methods over time. However, missing rigor in software citation practices renders the automated detection and disambiguation of software mentions a challenging problem. In this work, we provide a large-scale analysis of software usage and citation practices facilitated through an unprecedented knowledge graph of software mentions and affiliated metadata generated through supervised information extraction models trained on a unique gold standard corpus and applied to more than 3 million scientific articles. Our information extraction approach distinguishes different types of software and mentions, disambiguates mentions and outperforms the state-of-the-art significantly, leading to the most comprehensive corpus of 11.8 M software mentions that are described through a knowledge graph consisting of more than 300 M triples. Our analysis provides insights into the evolution of software usage and citation patterns across various fields, ranks of journals, and impact of publications. Whereas, to the best of our knowledge, this is the most comprehensive analysis of software use and citation at the time, all data and models are shared publicly to facilitate further research into scientific use and citation of software.
format Online
Article
Text
id pubmed-8771769
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher PeerJ Inc.
record_format MEDLINE/PubMed
spelling pubmed-87717692022-02-01 The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central Schindler, David Bensmann, Felix Dietze, Stefan Krüger, Frank PeerJ Comput Sci Data Mining and Machine Learning Science across all disciplines has become increasingly data-driven, leading to additional needs with respect to software for collecting, processing and analysing data. Thus, transparency about software used as part of the scientific process is crucial to understand provenance of individual research data and insights, is a prerequisite for reproducibility and can enable macro-analysis of the evolution of scientific methods over time. However, missing rigor in software citation practices renders the automated detection and disambiguation of software mentions a challenging problem. In this work, we provide a large-scale analysis of software usage and citation practices facilitated through an unprecedented knowledge graph of software mentions and affiliated metadata generated through supervised information extraction models trained on a unique gold standard corpus and applied to more than 3 million scientific articles. Our information extraction approach distinguishes different types of software and mentions, disambiguates mentions and outperforms the state-of-the-art significantly, leading to the most comprehensive corpus of 11.8 M software mentions that are described through a knowledge graph consisting of more than 300 M triples. Our analysis provides insights into the evolution of software usage and citation patterns across various fields, ranks of journals, and impact of publications. Whereas, to the best of our knowledge, this is the most comprehensive analysis of software use and citation at the time, all data and models are shared publicly to facilitate further research into scientific use and citation of software. PeerJ Inc. 2022-01-14 /pmc/articles/PMC8771769/ /pubmed/35111920 http://dx.doi.org/10.7717/peerj-cs.835 Text en © 2022 Schindler et al. https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, reproduction and adaptation in any medium and for any purpose provided that it is properly attributed. For attribution, the original author(s), title, publication source (PeerJ Computer Science) and either DOI or URL of the article must be cited.
spellingShingle Data Mining and Machine Learning
Schindler, David
Bensmann, Felix
Dietze, Stefan
Krüger, Frank
The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central
title The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central
title_full The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central
title_fullStr The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central
title_full_unstemmed The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central
title_short The role of software in science: a knowledge graph-based analysis of software mentions in PubMed Central
title_sort role of software in science: a knowledge graph-based analysis of software mentions in pubmed central
topic Data Mining and Machine Learning
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8771769/
https://www.ncbi.nlm.nih.gov/pubmed/35111920
http://dx.doi.org/10.7717/peerj-cs.835
work_keys_str_mv AT schindlerdavid theroleofsoftwareinscienceaknowledgegraphbasedanalysisofsoftwarementionsinpubmedcentral
AT bensmannfelix theroleofsoftwareinscienceaknowledgegraphbasedanalysisofsoftwarementionsinpubmedcentral
AT dietzestefan theroleofsoftwareinscienceaknowledgegraphbasedanalysisofsoftwarementionsinpubmedcentral
AT krugerfrank theroleofsoftwareinscienceaknowledgegraphbasedanalysisofsoftwarementionsinpubmedcentral
AT schindlerdavid roleofsoftwareinscienceaknowledgegraphbasedanalysisofsoftwarementionsinpubmedcentral
AT bensmannfelix roleofsoftwareinscienceaknowledgegraphbasedanalysisofsoftwarementionsinpubmedcentral
AT dietzestefan roleofsoftwareinscienceaknowledgegraphbasedanalysisofsoftwarementionsinpubmedcentral
AT krugerfrank roleofsoftwareinscienceaknowledgegraphbasedanalysisofsoftwarementionsinpubmedcentral