Cargando…

A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines

BACKGROUND: Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformat...

Descripción completa

Detalles Bibliográficos
Autores principales: Cieślik, Marcin, Mura, Cameron
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3051902/
https://www.ncbi.nlm.nih.gov/pubmed/21352538
http://dx.doi.org/10.1186/1471-2105-12-61
_version_ 1782199579372945408
author Cieślik, Marcin
Mura, Cameron
author_facet Cieślik, Marcin
Mura, Cameron
author_sort Cieślik, Marcin
collection PubMed
description BACKGROUND: Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts. RESULTS: To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats). CONCLUSIONS: PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples.
format Text
id pubmed-3051902
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-30519022011-04-04 A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines Cieślik, Marcin Mura, Cameron BMC Bioinformatics Software BACKGROUND: Bioinformatic analyses typically proceed as chains of data-processing tasks. A pipeline, or 'workflow', is a well-defined protocol, with a specific structure defined by the topology of data-flow interdependencies, and a particular functionality arising from the data transformations applied at each step. In computer science, the dataflow programming (DFP) paradigm defines software systems constructed in this manner, as networks of message-passing components. Thus, bioinformatic workflows can be naturally mapped onto DFP concepts. RESULTS: To enable the flexible creation and execution of bioinformatics dataflows, we have written a modular framework for parallel pipelines in Python ('PaPy'). A PaPy workflow is created from re-usable components connected by data-pipes into a directed acyclic graph, which together define nested higher-order map functions. The successive functional transformations of input data are evaluated on flexibly pooled compute resources, either local or remote. Input items are processed in batches of adjustable size, all flowing one to tune the trade-off between parallelism and lazy-evaluation (memory consumption). An add-on module ('NuBio') facilitates the creation of bioinformatics workflows by providing domain specific data-containers (e.g., for biomolecular sequences, alignments, structures) and functionality (e.g., to parse/write standard file formats). CONCLUSIONS: PaPy offers a modular framework for the creation and deployment of parallel and distributed data-processing workflows. Pipelines derive their functionality from user-written, data-coupled components, so PaPy also can be viewed as a lightweight toolkit for extensible, flow-based bioinformatics data-processing. The simplicity and flexibility of distributed PaPy pipelines may help users bridge the gap between traditional desktop/workstation and grid computing. PaPy is freely distributed as open-source Python code at http://muralab.org/PaPy, and includes extensive documentation and annotated usage examples. BioMed Central 2011-02-25 /pmc/articles/PMC3051902/ /pubmed/21352538 http://dx.doi.org/10.1186/1471-2105-12-61 Text en Copyright © 2011 Cieślik and Mura; licensee BioMed Central Ltd. https://creativecommons.org/licenses/by/2.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0 (https://creativecommons.org/licenses/by/2.0/) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Software
Cieślik, Marcin
Mura, Cameron
A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_full A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_fullStr A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_full_unstemmed A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_short A lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
title_sort lightweight, flow-based toolkit for parallel and distributed bioinformatics pipelines
topic Software
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3051902/
https://www.ncbi.nlm.nih.gov/pubmed/21352538
http://dx.doi.org/10.1186/1471-2105-12-61
work_keys_str_mv AT cieslikmarcin alightweightflowbasedtoolkitforparallelanddistributedbioinformaticspipelines
AT muracameron alightweightflowbasedtoolkitforparallelanddistributedbioinformaticspipelines
AT cieslikmarcin lightweightflowbasedtoolkitforparallelanddistributedbioinformaticspipelines
AT muracameron lightweightflowbasedtoolkitforparallelanddistributedbioinformaticspipelines