Cargando…

DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates

Large-scale processing of heterogeneous datasets in interdisciplinary research often requires time-consuming manual data curation. Ambiguity in the data layout and preprocessing conventions can easily compromise reproducibility and scientific discovery, and even when detected, it requires time and e...

Descripción completa

Detalles Bibliográficos
Autores principales:	Cardoen, Ben, Ben Yedder, Hanene, Lee, Sieun, Nabi, Ivan Robert, Hamarneh, Ghassan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Oxford University Press 2023
Materias:	Application Note
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10290225/ https://www.ncbi.nlm.nih.gov/pubmed/37359728 http://dx.doi.org/10.1093/bioadv/vbad068

_version_	1785062448225583104
author	Cardoen, Ben Ben Yedder, Hanene Lee, Sieun Nabi, Ivan Robert Hamarneh, Ghassan
author_facet	Cardoen, Ben Ben Yedder, Hanene Lee, Sieun Nabi, Ivan Robert Hamarneh, Ghassan
author_sort	Cardoen, Ben
collection	PubMed
description	Large-scale processing of heterogeneous datasets in interdisciplinary research often requires time-consuming manual data curation. Ambiguity in the data layout and preprocessing conventions can easily compromise reproducibility and scientific discovery, and even when detected, it requires time and effort to be corrected by domain experts. Poor data curation can also interrupt processing jobs on large computing clusters, causing frustration and delays. We introduce DataCurator, a portable software package that verifies arbitrarily complex datasets of mixed formats, working equally well on clusters as on local systems. Human-readable TOML recipes are converted into executable, machine-verifiable templates, enabling users to easily verify datasets using custom rules without writing code. Recipes can be used to transform and validate data, for pre- or post-processing, selection of data subsets, sampling and aggregation, such as summary statistics. Processing pipelines no longer need to be burdened by laborious data validation, with data curation and validation replaced by human and machine-verifiable recipes specifying rules and actions. Multithreaded execution ensures scalability on clusters, and existing Julia, R and Python libraries can be reused. DataCurator enables efficient remote workflows, offering integration with Slack and the ability to transfer curated data to clusters using OwnCloud and SCP. Code available at: https://github.com/bencardoen/DataCurator.jl.
format	Online Article Text
id	pubmed-10290225
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-102902252023-06-25 DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates Cardoen, Ben Ben Yedder, Hanene Lee, Sieun Nabi, Ivan Robert Hamarneh, Ghassan Bioinform Adv Application Note Large-scale processing of heterogeneous datasets in interdisciplinary research often requires time-consuming manual data curation. Ambiguity in the data layout and preprocessing conventions can easily compromise reproducibility and scientific discovery, and even when detected, it requires time and effort to be corrected by domain experts. Poor data curation can also interrupt processing jobs on large computing clusters, causing frustration and delays. We introduce DataCurator, a portable software package that verifies arbitrarily complex datasets of mixed formats, working equally well on clusters as on local systems. Human-readable TOML recipes are converted into executable, machine-verifiable templates, enabling users to easily verify datasets using custom rules without writing code. Recipes can be used to transform and validate data, for pre- or post-processing, selection of data subsets, sampling and aggregation, such as summary statistics. Processing pipelines no longer need to be burdened by laborious data validation, with data curation and validation replaced by human and machine-verifiable recipes specifying rules and actions. Multithreaded execution ensures scalability on clusters, and existing Julia, R and Python libraries can be reused. DataCurator enables efficient remote workflows, offering integration with Slack and the ability to transfer curated data to clusters using OwnCloud and SCP. Code available at: https://github.com/bencardoen/DataCurator.jl. Oxford University Press 2023-06-01 /pmc/articles/PMC10290225/ /pubmed/37359728 http://dx.doi.org/10.1093/bioadv/vbad068 Text en © The Author(s) 2023. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Application Note Cardoen, Ben Ben Yedder, Hanene Lee, Sieun Nabi, Ivan Robert Hamarneh, Ghassan DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates
title	DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates
title_full	DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates
title_fullStr	DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates
title_full_unstemmed	DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates
title_short	DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates
title_sort	datacurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates
topic	Application Note
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10290225/ https://www.ncbi.nlm.nih.gov/pubmed/37359728 http://dx.doi.org/10.1093/bioadv/vbad068
work_keys_str_mv	AT cardoenben datacuratorjlefficientportableandreproduciblevalidationcurationandtransformationoflargeheterogeneousdatasetsusinghumanreadablerecipescompiledintomachineverifiabletemplates AT benyedderhanene datacuratorjlefficientportableandreproduciblevalidationcurationandtransformationoflargeheterogeneousdatasetsusinghumanreadablerecipescompiledintomachineverifiabletemplates AT leesieun datacuratorjlefficientportableandreproduciblevalidationcurationandtransformationoflargeheterogeneousdatasetsusinghumanreadablerecipescompiledintomachineverifiabletemplates AT nabiivanrobert datacuratorjlefficientportableandreproduciblevalidationcurationandtransformationoflargeheterogeneousdatasetsusinghumanreadablerecipescompiledintomachineverifiabletemplates AT hamarnehghassan datacuratorjlefficientportableandreproduciblevalidationcurationandtransformationoflargeheterogeneousdatasetsusinghumanreadablerecipescompiledintomachineverifiabletemplates

DataCurator.jl: efficient, portable and reproducible validation, curation and transformation of large heterogeneous datasets using human-readable recipes compiled into machine-verifiable templates

Ejemplares similares