Cargando…
Halvade: scalable sequence analysis with MapReduce
Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enabl...
Autores principales: | , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4514927/ https://www.ncbi.nlm.nih.gov/pubmed/25819078 http://dx.doi.org/10.1093/bioinformatics/btv179 |
_version_ | 1782382839598153728 |
---|---|
author | Decap, Dries Reumers, Joke Herzeel, Charlotte Costanza, Pascal Fostier, Jan |
author_facet | Decap, Dries Reumers, Joke Herzeel, Charlotte Costanza, Pascal Fostier, Jan |
author_sort | Decap, Dries |
collection | PubMed |
description | Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR. Its source is available at http://bioinformatics.intec.ugent.be/halvade under GPL license. Contact: jan.fostier@intec.ugent.be Supplementary information: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-4514927 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-45149272015-07-27 Halvade: scalable sequence analysis with MapReduce Decap, Dries Reumers, Joke Herzeel, Charlotte Costanza, Pascal Fostier, Jan Bioinformatics Original Papers Motivation: Post-sequencing DNA analysis typically consists of read mapping followed by variant calling. Especially for whole genome sequencing, this computational step is very time-consuming, even when using multithreading on a multi-core machine. Results: We present Halvade, a framework that enables sequencing pipelines to be executed in parallel on a multi-node and/or multi-core compute infrastructure in a highly efficient manner. As an example, a DNA sequencing analysis pipeline for variant calling has been implemented according to the GATK Best Practices recommendations, supporting both whole genome and whole exome sequencing. Using a 15-node computer cluster with 360 CPU cores in total, Halvade processes the NA12878 dataset (human, 100 bp paired-end reads, 50× coverage) in <3 h with very high parallel efficiency. Even on a single, multi-core machine, Halvade attains a significant speedup compared with running the individual tools with multithreading. Availability and implementation: Halvade is written in Java and uses the Hadoop MapReduce 2.0 API. It supports a wide range of distributions of Hadoop, including Cloudera and Amazon EMR. Its source is available at http://bioinformatics.intec.ugent.be/halvade under GPL license. Contact: jan.fostier@intec.ugent.be Supplementary information: Supplementary data are available at Bioinformatics online. Oxford University Press 2015-08-01 2015-03-26 /pmc/articles/PMC4514927/ /pubmed/25819078 http://dx.doi.org/10.1093/bioinformatics/btv179 Text en © The Author 2015. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com |
spellingShingle | Original Papers Decap, Dries Reumers, Joke Herzeel, Charlotte Costanza, Pascal Fostier, Jan Halvade: scalable sequence analysis with MapReduce |
title | Halvade: scalable sequence analysis with MapReduce |
title_full | Halvade: scalable sequence analysis with MapReduce |
title_fullStr | Halvade: scalable sequence analysis with MapReduce |
title_full_unstemmed | Halvade: scalable sequence analysis with MapReduce |
title_short | Halvade: scalable sequence analysis with MapReduce |
title_sort | halvade: scalable sequence analysis with mapreduce |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4514927/ https://www.ncbi.nlm.nih.gov/pubmed/25819078 http://dx.doi.org/10.1093/bioinformatics/btv179 |
work_keys_str_mv | AT decapdries halvadescalablesequenceanalysiswithmapreduce AT reumersjoke halvadescalablesequenceanalysiswithmapreduce AT herzeelcharlotte halvadescalablesequenceanalysiswithmapreduce AT costanzapascal halvadescalablesequenceanalysiswithmapreduce AT fostierjan halvadescalablesequenceanalysiswithmapreduce |