Cargando…

Bioinformatics Application with Kubeflow for Batch Processing in Clouds

Bioinformatics pipelines make extensive use of HPC batch processing. The rapid growth of data volumes and computational complexity, especially for modern applications such as machine learning algorithms, imposes significant challenges to local HPC facilities. Many attempts have been made to burst HP...

Descripción completa

Detalles Bibliográficos
Autores principales: Yuan, David Yu, Wildish, Tony
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571545/
http://dx.doi.org/10.1007/978-3-030-59851-8_24
_version_ 1783597189086314496
author Yuan, David Yu
Wildish, Tony
author_facet Yuan, David Yu
Wildish, Tony
author_sort Yuan, David Yu
collection PubMed
description Bioinformatics pipelines make extensive use of HPC batch processing. The rapid growth of data volumes and computational complexity, especially for modern applications such as machine learning algorithms, imposes significant challenges to local HPC facilities. Many attempts have been made to burst HPC batch processing into clouds with virtual machines. They all suffer from some common issues, for example: very high overhead, slow to scale up and slow to scale down, and nearly impossible to be cloud-agnostic. We have successfully deployed and run several pipelines on Kubernetes in OpenStack, Google Cloud Platform and Amazon Web Services. In particular, we use Kubeflow on top of Kubernetes for more sophisticated job scheduling, workflow management, and first class support for machine learning. We choose Kubeflow/Kubernetes to avoid the overhead of provisioning of virtual machines, to achieve rapid scaling with containers, and to be truly cloud-agnostic in all cloud environments. Kubeflow on Kubernetes also creates some new challenges in deployment, data access, performance monitoring, etc. We will discuss the details of these challenges and provide our solutions. We will demonstrate how our solutions work across all three very different clouds for both classical pipelines and new ones for machine learning.
format Online
Article
Text
id pubmed-7571545
institution National Center for Biotechnology Information
language English
publishDate 2020
record_format MEDLINE/PubMed
spelling pubmed-75715452020-10-20 Bioinformatics Application with Kubeflow for Batch Processing in Clouds Yuan, David Yu Wildish, Tony High Performance Computing Article Bioinformatics pipelines make extensive use of HPC batch processing. The rapid growth of data volumes and computational complexity, especially for modern applications such as machine learning algorithms, imposes significant challenges to local HPC facilities. Many attempts have been made to burst HPC batch processing into clouds with virtual machines. They all suffer from some common issues, for example: very high overhead, slow to scale up and slow to scale down, and nearly impossible to be cloud-agnostic. We have successfully deployed and run several pipelines on Kubernetes in OpenStack, Google Cloud Platform and Amazon Web Services. In particular, we use Kubeflow on top of Kubernetes for more sophisticated job scheduling, workflow management, and first class support for machine learning. We choose Kubeflow/Kubernetes to avoid the overhead of provisioning of virtual machines, to achieve rapid scaling with containers, and to be truly cloud-agnostic in all cloud environments. Kubeflow on Kubernetes also creates some new challenges in deployment, data access, performance monitoring, etc. We will discuss the details of these challenges and provide our solutions. We will demonstrate how our solutions work across all three very different clouds for both classical pipelines and new ones for machine learning. 2020-09-15 /pmc/articles/PMC7571545/ http://dx.doi.org/10.1007/978-3-030-59851-8_24 Text en © The Author(s) 2020 Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
spellingShingle Article
Yuan, David Yu
Wildish, Tony
Bioinformatics Application with Kubeflow for Batch Processing in Clouds
title Bioinformatics Application with Kubeflow for Batch Processing in Clouds
title_full Bioinformatics Application with Kubeflow for Batch Processing in Clouds
title_fullStr Bioinformatics Application with Kubeflow for Batch Processing in Clouds
title_full_unstemmed Bioinformatics Application with Kubeflow for Batch Processing in Clouds
title_short Bioinformatics Application with Kubeflow for Batch Processing in Clouds
title_sort bioinformatics application with kubeflow for batch processing in clouds
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571545/
http://dx.doi.org/10.1007/978-3-030-59851-8_24
work_keys_str_mv AT yuandavidyu bioinformaticsapplicationwithkubeflowforbatchprocessinginclouds
AT wildishtony bioinformaticsapplicationwithkubeflowforbatchprocessinginclouds