Cargando…

DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation

Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of vari...

Descripción completa

Detalles Bibliográficos
Autores principales: Al Hajj, Ghadi S., Pensar, Johan, Sandve, Geir K.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10104342/
https://www.ncbi.nlm.nih.gov/pubmed/37058511
http://dx.doi.org/10.1371/journal.pone.0284443
_version_ 1785026021487017984
author Al Hajj, Ghadi S.
Pensar, Johan
Sandve, Geir K.
author_facet Al Hajj, Ghadi S.
Pensar, Johan
Sandve, Geir K.
author_sort Al Hajj, Ghadi S.
collection PubMed
description Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences. DagSim is available as a Python package at PyPI. Source code and documentation are available at: https://github.com/uio-bmi/dagsim
format Online
Article
Text
id pubmed-10104342
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-101043422023-04-15 DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation Al Hajj, Ghadi S. Pensar, Johan Sandve, Geir K. PLoS One Research Article Data simulation is fundamental for machine learning and causal inference, as it allows exploration of scenarios and assessment of methods in settings with full control of ground truth. Directed acyclic graphs (DAGs) are well established for encoding the dependence structure over a collection of variables in both inference and simulation settings. However, while modern machine learning is applied to data of an increasingly complex nature, DAG-based simulation frameworks are still confined to settings with relatively simple variable types and functional forms. We here present DagSim, a Python-based framework for DAG-based data simulation without any constraints on variable types or functional relations. A succinct YAML format for defining the simulation model structure promotes transparency, while separate user-provided functions for generating each variable based on its parents ensure simulation code modularization. We illustrate the capabilities of DagSim through use cases where metadata variables control shapes in an image and patterns in bio-sequences. DagSim is available as a Python package at PyPI. Source code and documentation are available at: https://github.com/uio-bmi/dagsim Public Library of Science 2023-04-14 /pmc/articles/PMC10104342/ /pubmed/37058511 http://dx.doi.org/10.1371/journal.pone.0284443 Text en © 2023 Al Hajj et al https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Al Hajj, Ghadi S.
Pensar, Johan
Sandve, Geir K.
DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation
title DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation
title_full DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation
title_fullStr DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation
title_full_unstemmed DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation
title_short DagSim: Combining DAG-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation
title_sort dagsim: combining dag-based model structure with unconstrained data types and relations for flexible, transparent, and modularized data simulation
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10104342/
https://www.ncbi.nlm.nih.gov/pubmed/37058511
http://dx.doi.org/10.1371/journal.pone.0284443
work_keys_str_mv AT alhajjghadis dagsimcombiningdagbasedmodelstructurewithunconstraineddatatypesandrelationsforflexibletransparentandmodularizeddatasimulation
AT pensarjohan dagsimcombiningdagbasedmodelstructurewithunconstraineddatatypesandrelationsforflexibletransparentandmodularizeddatasimulation
AT sandvegeirk dagsimcombiningdagbasedmodelstructurewithunconstraineddatatypesandrelationsforflexibletransparentandmodularizeddatasimulation