Cargando…

Scaling bioinformatics applications on HPC

BACKGROUND: Recent breakthroughs in molecular biology and next generation sequencing technologies have led to the expenential growh of the sequence databases. Researchrs use BLAST for processing these sequences. However traditional software parallelization techniques (threads, message passing interf...

Descripción completa

Detalles Bibliográficos
Autores principales:	Mikailov, Mike, Luo, Fu-Jyh, Barkley, Stuart, Valleru, Lohit, Whitney, Stephen, Liu, Zhichao, Thakkar, Shraddha, Tong, Weida, Petrick, Nicholas
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2017
Materias:	Research
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5751788/ https://www.ncbi.nlm.nih.gov/pubmed/29297287 http://dx.doi.org/10.1186/s12859-017-1902-7

_version_	1783290018660352000
author	Mikailov, Mike Luo, Fu-Jyh Barkley, Stuart Valleru, Lohit Whitney, Stephen Liu, Zhichao Thakkar, Shraddha Tong, Weida Petrick, Nicholas
author_facet	Mikailov, Mike Luo, Fu-Jyh Barkley, Stuart Valleru, Lohit Whitney, Stephen Liu, Zhichao Thakkar, Shraddha Tong, Weida Petrick, Nicholas
author_sort	Mikailov, Mike
collection	PubMed
description	BACKGROUND: Recent breakthroughs in molecular biology and next generation sequencing technologies have led to the expenential growh of the sequence databases. Researchrs use BLAST for processing these sequences. However traditional software parallelization techniques (threads, message passing interface) applied in newer versios of BLAST are not adequate for processing these sequences in timely manner. METHODS: A new method for array job parallelization has been developed which offers O(T) theoretical speed-up in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. (The number of CPUs that will be used to complete the job equals the product of T multiplied by the number of CPUs used by a single task.) The approach is based on segmentation of both input datasets to the BLAST process, combining partial solutions published earlier (Dhanker and Gupta, Int J Comput Sci Inf Technol_5:4818-4820, 2014), (Grant et al., Bioinformatics_18:765-766, 2002), (Mathog, Bioinformatics_19:1865-1866, 2003). It is accordingly referred to as a “dual segmentation” method. In order to implement the new method, the BLAST source code was modified to allow the researcher to pass to the program the number of records (effective number of sequences) in the original database. The team also developed methods to manage and consolidate the large number of partial results that get produced. Dual segmentation allows for massive parallelization, which lifts the scaling ceiling in exciting ways. RESULTS: BLAST jobs that hitherto failed or slogged inefficiently to completion now finish with speeds that characteristically reduce wallclock time from 27 days on 40 CPUs to a single day using 4104 tasks, each task utilizing eight CPUs and taking less than 7 minutes to complete. CONCLUSIONS: The massive increase in the number of tasks when running an analysis job with dual segmentation reduces the size, scope and execution time of each task. Besides significant speed of completion, additional benefits include fine-grained checkpointing and increased flexibility of job submission. “Trickling in” a swarm of individual small tasks tempers competition for CPU time in the shared HPC environment, and jobs submitted during quiet periods can complete in extraordinarily short time frames. The smaller task size also allows the use of older and less powerful hardware. The CDRH workhorse cluster was commissioned in 2010, yet its eight-core CPUs with only 24GB RAM work well in 2017 for these dual segmentation jobs. Finally, these techniques are excitingly friendly to budget conscious scientific research organizations where probabilistic algorithms such as BLAST might discourage attempts at greater certainty because single runs represent a major resource drain. If a job that used to take 24 days can now be completed in less than an hour or on a space available basis (which is the case at CDRH), repeated runs for more exhaustive analyses can be usefully contemplated.
format	Online Article Text
id	pubmed-5751788
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-57517882018-01-05 Scaling bioinformatics applications on HPC Mikailov, Mike Luo, Fu-Jyh Barkley, Stuart Valleru, Lohit Whitney, Stephen Liu, Zhichao Thakkar, Shraddha Tong, Weida Petrick, Nicholas BMC Bioinformatics Research BACKGROUND: Recent breakthroughs in molecular biology and next generation sequencing technologies have led to the expenential growh of the sequence databases. Researchrs use BLAST for processing these sequences. However traditional software parallelization techniques (threads, message passing interface) applied in newer versios of BLAST are not adequate for processing these sequences in timely manner. METHODS: A new method for array job parallelization has been developed which offers O(T) theoretical speed-up in comparison to multi-threading and MPI techniques. Here T is the number of array job tasks. (The number of CPUs that will be used to complete the job equals the product of T multiplied by the number of CPUs used by a single task.) The approach is based on segmentation of both input datasets to the BLAST process, combining partial solutions published earlier (Dhanker and Gupta, Int J Comput Sci Inf Technol_5:4818-4820, 2014), (Grant et al., Bioinformatics_18:765-766, 2002), (Mathog, Bioinformatics_19:1865-1866, 2003). It is accordingly referred to as a “dual segmentation” method. In order to implement the new method, the BLAST source code was modified to allow the researcher to pass to the program the number of records (effective number of sequences) in the original database. The team also developed methods to manage and consolidate the large number of partial results that get produced. Dual segmentation allows for massive parallelization, which lifts the scaling ceiling in exciting ways. RESULTS: BLAST jobs that hitherto failed or slogged inefficiently to completion now finish with speeds that characteristically reduce wallclock time from 27 days on 40 CPUs to a single day using 4104 tasks, each task utilizing eight CPUs and taking less than 7 minutes to complete. CONCLUSIONS: The massive increase in the number of tasks when running an analysis job with dual segmentation reduces the size, scope and execution time of each task. Besides significant speed of completion, additional benefits include fine-grained checkpointing and increased flexibility of job submission. “Trickling in” a swarm of individual small tasks tempers competition for CPU time in the shared HPC environment, and jobs submitted during quiet periods can complete in extraordinarily short time frames. The smaller task size also allows the use of older and less powerful hardware. The CDRH workhorse cluster was commissioned in 2010, yet its eight-core CPUs with only 24GB RAM work well in 2017 for these dual segmentation jobs. Finally, these techniques are excitingly friendly to budget conscious scientific research organizations where probabilistic algorithms such as BLAST might discourage attempts at greater certainty because single runs represent a major resource drain. If a job that used to take 24 days can now be completed in less than an hour or on a space available basis (which is the case at CDRH), repeated runs for more exhaustive analyses can be usefully contemplated. BioMed Central 2017-12-28 /pmc/articles/PMC5751788/ /pubmed/29297287 http://dx.doi.org/10.1186/s12859-017-1902-7 Text en © The Author(s). 2017 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Mikailov, Mike Luo, Fu-Jyh Barkley, Stuart Valleru, Lohit Whitney, Stephen Liu, Zhichao Thakkar, Shraddha Tong, Weida Petrick, Nicholas Scaling bioinformatics applications on HPC
title	Scaling bioinformatics applications on HPC
title_full	Scaling bioinformatics applications on HPC
title_fullStr	Scaling bioinformatics applications on HPC
title_full_unstemmed	Scaling bioinformatics applications on HPC
title_short	Scaling bioinformatics applications on HPC
title_sort	scaling bioinformatics applications on hpc
topic	Research
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5751788/ https://www.ncbi.nlm.nih.gov/pubmed/29297287 http://dx.doi.org/10.1186/s12859-017-1902-7
work_keys_str_mv	AT mikailovmike scalingbioinformaticsapplicationsonhpc AT luofujyh scalingbioinformaticsapplicationsonhpc AT barkleystuart scalingbioinformaticsapplicationsonhpc AT vallerulohit scalingbioinformaticsapplicationsonhpc AT whitneystephen scalingbioinformaticsapplicationsonhpc AT liuzhichao scalingbioinformaticsapplicationsonhpc AT thakkarshraddha scalingbioinformaticsapplicationsonhpc AT tongweida scalingbioinformaticsapplicationsonhpc AT petricknicholas scalingbioinformaticsapplicationsonhpc

Scaling bioinformatics applications on HPC

Ejemplares similares