Cargando…

Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package

Ecological research, just as all Earth System Sciences, is becoming increasingly data-rich. Tools for processing of “big data” are continuously developed to meet corresponding technical and logistical challenges. However, even at smaller scales, data sets may be challenging when best practices in da...

Descripción completa

Detalles Bibliográficos
Autores principales: Hurley, Alexander G., Peters, Richard L., Pappas, Christoforos, Steger, David N., Heinrich, Ingo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9098071/
https://www.ncbi.nlm.nih.gov/pubmed/35551557
http://dx.doi.org/10.1371/journal.pone.0268426
_version_ 1784706301288251392
author Hurley, Alexander G.
Peters, Richard L.
Pappas, Christoforos
Steger, David N.
Heinrich, Ingo
author_facet Hurley, Alexander G.
Peters, Richard L.
Pappas, Christoforos
Steger, David N.
Heinrich, Ingo
author_sort Hurley, Alexander G.
collection PubMed
description Ecological research, just as all Earth System Sciences, is becoming increasingly data-rich. Tools for processing of “big data” are continuously developed to meet corresponding technical and logistical challenges. However, even at smaller scales, data sets may be challenging when best practices in data exploration, quality control and reproducibility are to be met. This can occur when conventional methods, such as generating and assessing diagnostic visualizations or tables, become unfeasible due to time and practicality constraints. Interactive processing can alleviate this issue, and is increasingly utilized to ensure that large data sets are diligently handled. However, recent interactive tools rarely enable data manipulation, may not generate reproducible outputs, or are typically data/domain-specific. We developed datacleanr, an interactive tool that facilitates best practices in data exploration, quality control (e.g., outlier assessment) and flexible processing for multiple tabular data types, including time series and georeferenced data. The package is open-source, and based on the R programming language. A key functionality of datacleanr is the “reproducible recipe”—a translation of all interactive actions into R code, which can be integrated into existing analyses pipelines. This enables researchers experienced with script-based workflows to utilize the strengths of interactive processing without sacrificing their usual work style or functionalities from other (R) packages. We demonstrate the package’s utility by addressing two common issues during data analyses, namely 1) identifying problematic structures and artefacts in hierarchically nested data, and 2) preventing excessive loss of data from ‘coarse,’ code-based filtering of time series. Ultimately, with datacleanr we aim to improve researchers’ workflows and increase confidence in and reproducibility of their results.
format Online
Article
Text
id pubmed-9098071
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-90980712022-05-13 Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package Hurley, Alexander G. Peters, Richard L. Pappas, Christoforos Steger, David N. Heinrich, Ingo PLoS One Research Article Ecological research, just as all Earth System Sciences, is becoming increasingly data-rich. Tools for processing of “big data” are continuously developed to meet corresponding technical and logistical challenges. However, even at smaller scales, data sets may be challenging when best practices in data exploration, quality control and reproducibility are to be met. This can occur when conventional methods, such as generating and assessing diagnostic visualizations or tables, become unfeasible due to time and practicality constraints. Interactive processing can alleviate this issue, and is increasingly utilized to ensure that large data sets are diligently handled. However, recent interactive tools rarely enable data manipulation, may not generate reproducible outputs, or are typically data/domain-specific. We developed datacleanr, an interactive tool that facilitates best practices in data exploration, quality control (e.g., outlier assessment) and flexible processing for multiple tabular data types, including time series and georeferenced data. The package is open-source, and based on the R programming language. A key functionality of datacleanr is the “reproducible recipe”—a translation of all interactive actions into R code, which can be integrated into existing analyses pipelines. This enables researchers experienced with script-based workflows to utilize the strengths of interactive processing without sacrificing their usual work style or functionalities from other (R) packages. We demonstrate the package’s utility by addressing two common issues during data analyses, namely 1) identifying problematic structures and artefacts in hierarchically nested data, and 2) preventing excessive loss of data from ‘coarse,’ code-based filtering of time series. Ultimately, with datacleanr we aim to improve researchers’ workflows and increase confidence in and reproducibility of their results. Public Library of Science 2022-05-12 /pmc/articles/PMC9098071/ /pubmed/35551557 http://dx.doi.org/10.1371/journal.pone.0268426 Text en © 2022 Hurley et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Hurley, Alexander G.
Peters, Richard L.
Pappas, Christoforos
Steger, David N.
Heinrich, Ingo
Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_full Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_fullStr Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_full_unstemmed Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_short Addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr R package
title_sort addressing the need for interactive, efficient, and reproducible data processing in ecology with the datacleanr r package
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9098071/
https://www.ncbi.nlm.nih.gov/pubmed/35551557
http://dx.doi.org/10.1371/journal.pone.0268426
work_keys_str_mv AT hurleyalexanderg addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage
AT petersrichardl addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage
AT pappaschristoforos addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage
AT stegerdavidn addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage
AT heinrichingo addressingtheneedforinteractiveefficientandreproducibledataprocessinginecologywiththedatacleanrrpackage