Cargando…

Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems

Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a...

Descripción completa

Detalles Bibliográficos
Autores principales: Djaffardjy, Marine, Marchment, George, Sebe, Clémence, Blanchet, Raphael, Bellajhame, Khalid, Gaignard, Alban, Lemoine, Frédéric, Cohen-Boulakia, Sarah
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Research Network of Computational and Structural Biotechnology 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10030817/
https://www.ncbi.nlm.nih.gov/pubmed/36968012
http://dx.doi.org/10.1016/j.csbj.2023.03.003
_version_ 1784910460382871552
author Djaffardjy, Marine
Marchment, George
Sebe, Clémence
Blanchet, Raphael
Bellajhame, Khalid
Gaignard, Alban
Lemoine, Frédéric
Cohen-Boulakia, Sarah
author_facet Djaffardjy, Marine
Marchment, George
Sebe, Clémence
Blanchet, Raphael
Bellajhame, Khalid
Gaignard, Alban
Lemoine, Frédéric
Cohen-Boulakia, Sarah
author_sort Djaffardjy, Marine
collection PubMed
description Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows.
format Online
Article
Text
id pubmed-10030817
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Research Network of Computational and Structural Biotechnology
record_format MEDLINE/PubMed
spelling pubmed-100308172023-03-23 Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems Djaffardjy, Marine Marchment, George Sebe, Clémence Blanchet, Raphael Bellajhame, Khalid Gaignard, Alban Lemoine, Frédéric Cohen-Boulakia, Sarah Comput Struct Biotechnol J Review Article Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows. Research Network of Computational and Structural Biotechnology 2023-03-07 /pmc/articles/PMC10030817/ /pubmed/36968012 http://dx.doi.org/10.1016/j.csbj.2023.03.003 Text en © 2023 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
spellingShingle Review Article
Djaffardjy, Marine
Marchment, George
Sebe, Clémence
Blanchet, Raphael
Bellajhame, Khalid
Gaignard, Alban
Lemoine, Frédéric
Cohen-Boulakia, Sarah
Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems
title Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems
title_full Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems
title_fullStr Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems
title_full_unstemmed Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems
title_short Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems
title_sort developing and reusing bioinformatics data analysis pipelines using scientific workflow systems
topic Review Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10030817/
https://www.ncbi.nlm.nih.gov/pubmed/36968012
http://dx.doi.org/10.1016/j.csbj.2023.03.003
work_keys_str_mv AT djaffardjymarine developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems
AT marchmentgeorge developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems
AT sebeclemence developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems
AT blanchetraphael developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems
AT bellajhamekhalid developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems
AT gaignardalban developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems
AT lemoinefrederic developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems
AT cohenboulakiasarah developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems