Cargando…

doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows

BACKGROUND: Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make...

Descripción completa

Detalles Bibliográficos
Autores principales:	Svensson, Daniel, Sjögren, Rickard, Sundell, David, Sjödin, Andreas, Trygg, Johan
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2019
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6794737/ https://www.ncbi.nlm.nih.gov/pubmed/31615395 http://dx.doi.org/10.1186/s12859-019-3091-z

_version_	1783459349781282816
author	Svensson, Daniel Sjögren, Rickard Sundell, David Sjödin, Andreas Trygg, Johan
author_facet	Svensson, Daniel Sjögren, Rickard Sundell, David Sjödin, Andreas Trygg, Johan
author_sort	Svensson, Daniel
collection	PubMed
description	BACKGROUND: Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed. RESULTS: We present doepipeline, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling. Doepipeline was used to optimize parameters in four use cases; 1) de-novo assembly, 2) scaffolding of a fragmented genome assembly, 3) k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and 4) genetic variant calling. In all four cases, doepipeline found parameter settings that produced a better outcome with respect to the characteristic measured when compared to using default values. Our approach is implemented and available in the Python package doepipeline. CONCLUSIONS: Our proposed methodology provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking. Implementation in doepipeline makes our methodology accessible and user-friendly, and allows for automatic optimization of tools in a wide range of cases. The source code of doepipeline is available at https://github.com/clicumu/doepipeline and it can be installed through conda-forge.
format	Online Article Text
id	pubmed-6794737
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-67947372019-10-21 doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows Svensson, Daniel Sjögren, Rickard Sundell, David Sjödin, Andreas Trygg, Johan BMC Bioinformatics Methodology Article BACKGROUND: Selecting the proper parameter settings for bioinformatic software tools is challenging. Not only will each parameter have an individual effect on the outcome, but there are also potential interaction effects between parameters. Both of these effects may be difficult to predict. To make the situation even more complex, multiple tools may be run in a sequential pipeline where the final output depends on the parameter configuration for each tool in the pipeline. Because of the complexity and difficulty of predicting outcomes, in practice parameters are often left at default settings or set based on personal or peer experience obtained in a trial and error fashion. To allow for the reliable and efficient selection of parameters for bioinformatic pipelines, a systematic approach is needed. RESULTS: We present doepipeline, a novel approach to optimizing bioinformatic software parameters, based on core concepts of the Design of Experiments methodology and recent advances in subset designs. Optimal parameter settings are first approximated in a screening phase using a subset design that efficiently spans the entire search space, then optimized in the subsequent phase using response surface designs and OLS modeling. Doepipeline was used to optimize parameters in four use cases; 1) de-novo assembly, 2) scaffolding of a fragmented genome assembly, 3) k-mer taxonomic classification of Oxford Nanopore Technologies MinION reads, and 4) genetic variant calling. In all four cases, doepipeline found parameter settings that produced a better outcome with respect to the characteristic measured when compared to using default values. Our approach is implemented and available in the Python package doepipeline. CONCLUSIONS: Our proposed methodology provides a systematic and robust framework for optimizing software parameter settings, in contrast to labor- and time-intensive manual parameter tweaking. Implementation in doepipeline makes our methodology accessible and user-friendly, and allows for automatic optimization of tools in a wide range of cases. The source code of doepipeline is available at https://github.com/clicumu/doepipeline and it can be installed through conda-forge. BioMed Central 2019-10-15 /pmc/articles/PMC6794737/ /pubmed/31615395 http://dx.doi.org/10.1186/s12859-019-3091-z Text en © The Author(s). 2019 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Svensson, Daniel Sjögren, Rickard Sundell, David Sjödin, Andreas Trygg, Johan doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_full	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_fullStr	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_full_unstemmed	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_short	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
title_sort	doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6794737/ https://www.ncbi.nlm.nih.gov/pubmed/31615395 http://dx.doi.org/10.1186/s12859-019-3091-z
work_keys_str_mv	AT svenssondaniel doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows AT sjogrenrickard doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows AT sundelldavid doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows AT sjodinandreas doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows AT tryggjohan doepipelineasystematicapproachtooptimizingmultilevelandmultistepdataprocessingworkflows

doepipeline: a systematic approach to optimizing multi-level and multi-step data processing workflows

Ejemplares similares