Cargando…
Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems
Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a...
Autores principales: | , , , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Research Network of Computational and Structural Biotechnology
2023
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10030817/ https://www.ncbi.nlm.nih.gov/pubmed/36968012 http://dx.doi.org/10.1016/j.csbj.2023.03.003 |
_version_ | 1784910460382871552 |
---|---|
author | Djaffardjy, Marine Marchment, George Sebe, Clémence Blanchet, Raphael Bellajhame, Khalid Gaignard, Alban Lemoine, Frédéric Cohen-Boulakia, Sarah |
author_facet | Djaffardjy, Marine Marchment, George Sebe, Clémence Blanchet, Raphael Bellajhame, Khalid Gaignard, Alban Lemoine, Frédéric Cohen-Boulakia, Sarah |
author_sort | Djaffardjy, Marine |
collection | PubMed |
description | Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows. |
format | Online Article Text |
id | pubmed-10030817 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2023 |
publisher | Research Network of Computational and Structural Biotechnology |
record_format | MEDLINE/PubMed |
spelling | pubmed-100308172023-03-23 Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems Djaffardjy, Marine Marchment, George Sebe, Clémence Blanchet, Raphael Bellajhame, Khalid Gaignard, Alban Lemoine, Frédéric Cohen-Boulakia, Sarah Comput Struct Biotechnol J Review Article Data analysis pipelines are now established as an effective means for specifying and executing bioinformatics data analysis and experiments. While scripting languages, particularly Python, R and notebooks, are popular and sufficient for developing small-scale pipelines that are often intended for a single user, it is now widely recognized that they are by no means enough to support the development of large-scale, shareable, maintainable and reusable pipelines capable of handling large volumes of data and running on high performance computing clusters. This review outlines the key requirements for building large-scale data pipelines and provides a mapping of existing solutions that fulfill them. We then highlight the benefits of using scientific workflow systems to get modular, reproducible and reusable bioinformatics data analysis pipelines. We finally discuss current workflow reuse practices based on an empirical study we performed on a large collection of workflows. Research Network of Computational and Structural Biotechnology 2023-03-07 /pmc/articles/PMC10030817/ /pubmed/36968012 http://dx.doi.org/10.1016/j.csbj.2023.03.003 Text en © 2023 Published by Elsevier B.V. on behalf of Research Network of Computational and Structural Biotechnology. https://creativecommons.org/licenses/by-nc-nd/4.0/This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). |
spellingShingle | Review Article Djaffardjy, Marine Marchment, George Sebe, Clémence Blanchet, Raphael Bellajhame, Khalid Gaignard, Alban Lemoine, Frédéric Cohen-Boulakia, Sarah Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems |
title | Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems |
title_full | Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems |
title_fullStr | Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems |
title_full_unstemmed | Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems |
title_short | Developing and reusing bioinformatics data analysis pipelines using scientific workflow systems |
title_sort | developing and reusing bioinformatics data analysis pipelines using scientific workflow systems |
topic | Review Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10030817/ https://www.ncbi.nlm.nih.gov/pubmed/36968012 http://dx.doi.org/10.1016/j.csbj.2023.03.003 |
work_keys_str_mv | AT djaffardjymarine developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems AT marchmentgeorge developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems AT sebeclemence developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems AT blanchetraphael developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems AT bellajhamekhalid developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems AT gaignardalban developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems AT lemoinefrederic developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems AT cohenboulakiasarah developingandreusingbioinformaticsdataanalysispipelinesusingscientificworkflowsystems |