Cargando…
Bioinformatics Application with Kubeflow for Batch Processing in Clouds
Bioinformatics pipelines make extensive use of HPC batch processing. The rapid growth of data volumes and computational complexity, especially for modern applications such as machine learning algorithms, imposes significant challenges to local HPC facilities. Many attempts have been made to burst HP...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2020
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571545/ http://dx.doi.org/10.1007/978-3-030-59851-8_24 |
_version_ | 1783597189086314496 |
---|---|
author | Yuan, David Yu Wildish, Tony |
author_facet | Yuan, David Yu Wildish, Tony |
author_sort | Yuan, David Yu |
collection | PubMed |
description | Bioinformatics pipelines make extensive use of HPC batch processing. The rapid growth of data volumes and computational complexity, especially for modern applications such as machine learning algorithms, imposes significant challenges to local HPC facilities. Many attempts have been made to burst HPC batch processing into clouds with virtual machines. They all suffer from some common issues, for example: very high overhead, slow to scale up and slow to scale down, and nearly impossible to be cloud-agnostic. We have successfully deployed and run several pipelines on Kubernetes in OpenStack, Google Cloud Platform and Amazon Web Services. In particular, we use Kubeflow on top of Kubernetes for more sophisticated job scheduling, workflow management, and first class support for machine learning. We choose Kubeflow/Kubernetes to avoid the overhead of provisioning of virtual machines, to achieve rapid scaling with containers, and to be truly cloud-agnostic in all cloud environments. Kubeflow on Kubernetes also creates some new challenges in deployment, data access, performance monitoring, etc. We will discuss the details of these challenges and provide our solutions. We will demonstrate how our solutions work across all three very different clouds for both classical pipelines and new ones for machine learning. |
format | Online Article Text |
id | pubmed-7571545 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2020 |
record_format | MEDLINE/PubMed |
spelling | pubmed-75715452020-10-20 Bioinformatics Application with Kubeflow for Batch Processing in Clouds Yuan, David Yu Wildish, Tony High Performance Computing Article Bioinformatics pipelines make extensive use of HPC batch processing. The rapid growth of data volumes and computational complexity, especially for modern applications such as machine learning algorithms, imposes significant challenges to local HPC facilities. Many attempts have been made to burst HPC batch processing into clouds with virtual machines. They all suffer from some common issues, for example: very high overhead, slow to scale up and slow to scale down, and nearly impossible to be cloud-agnostic. We have successfully deployed and run several pipelines on Kubernetes in OpenStack, Google Cloud Platform and Amazon Web Services. In particular, we use Kubeflow on top of Kubernetes for more sophisticated job scheduling, workflow management, and first class support for machine learning. We choose Kubeflow/Kubernetes to avoid the overhead of provisioning of virtual machines, to achieve rapid scaling with containers, and to be truly cloud-agnostic in all cloud environments. Kubeflow on Kubernetes also creates some new challenges in deployment, data access, performance monitoring, etc. We will discuss the details of these challenges and provide our solutions. We will demonstrate how our solutions work across all three very different clouds for both classical pipelines and new ones for machine learning. 2020-09-15 /pmc/articles/PMC7571545/ http://dx.doi.org/10.1007/978-3-030-59851-8_24 Text en © The Author(s) 2020 Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made. The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. |
spellingShingle | Article Yuan, David Yu Wildish, Tony Bioinformatics Application with Kubeflow for Batch Processing in Clouds |
title | Bioinformatics Application with Kubeflow for Batch Processing in Clouds |
title_full | Bioinformatics Application with Kubeflow for Batch Processing in Clouds |
title_fullStr | Bioinformatics Application with Kubeflow for Batch Processing in Clouds |
title_full_unstemmed | Bioinformatics Application with Kubeflow for Batch Processing in Clouds |
title_short | Bioinformatics Application with Kubeflow for Batch Processing in Clouds |
title_sort | bioinformatics application with kubeflow for batch processing in clouds |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7571545/ http://dx.doi.org/10.1007/978-3-030-59851-8_24 |
work_keys_str_mv | AT yuandavidyu bioinformaticsapplicationwithkubeflowforbatchprocessinginclouds AT wildishtony bioinformaticsapplicationwithkubeflowforbatchprocessinginclouds |