Cargando…

Using Apache Spark on genome assembly for scalable overlap-graph reduction

BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly d...

Descripción completa

Detalles Bibliográficos
Autores principales: Paul, Alexander J., Lawrence, Dylan, Song, Myoungkyu, Lim, Seung-Hwan, Pan, Chongle, Ahn, Tae-Hyuk
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6805285/
https://www.ncbi.nlm.nih.gov/pubmed/31639049
http://dx.doi.org/10.1186/s40246-019-0227-1
_version_ 1783461346331852800
author Paul, Alexander J.
Lawrence, Dylan
Song, Myoungkyu
Lim, Seung-Hwan
Pan, Chongle
Ahn, Tae-Hyuk
author_facet Paul, Alexander J.
Lawrence, Dylan
Song, Myoungkyu
Lim, Seung-Hwan
Pan, Chongle
Ahn, Tae-Hyuk
author_sort Paul, Alexander J.
collection PubMed
description BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. RESULTS: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. CONCLUSIONS: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances.
format Online
Article
Text
id pubmed-6805285
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-68052852019-10-24 Using Apache Spark on genome assembly for scalable overlap-graph reduction Paul, Alexander J. Lawrence, Dylan Song, Myoungkyu Lim, Seung-Hwan Pan, Chongle Ahn, Tae-Hyuk Hum Genomics Research BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. RESULTS: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. CONCLUSIONS: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances. BioMed Central 2019-10-22 /pmc/articles/PMC6805285/ /pubmed/31639049 http://dx.doi.org/10.1186/s40246-019-0227-1 Text en © Paul et al. 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle Research
Paul, Alexander J.
Lawrence, Dylan
Song, Myoungkyu
Lim, Seung-Hwan
Pan, Chongle
Ahn, Tae-Hyuk
Using Apache Spark on genome assembly for scalable overlap-graph reduction
title Using Apache Spark on genome assembly for scalable overlap-graph reduction
title_full Using Apache Spark on genome assembly for scalable overlap-graph reduction
title_fullStr Using Apache Spark on genome assembly for scalable overlap-graph reduction
title_full_unstemmed Using Apache Spark on genome assembly for scalable overlap-graph reduction
title_short Using Apache Spark on genome assembly for scalable overlap-graph reduction
title_sort using apache spark on genome assembly for scalable overlap-graph reduction
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6805285/
https://www.ncbi.nlm.nih.gov/pubmed/31639049
http://dx.doi.org/10.1186/s40246-019-0227-1
work_keys_str_mv AT paulalexanderj usingapachesparkongenomeassemblyforscalableoverlapgraphreduction
AT lawrencedylan usingapachesparkongenomeassemblyforscalableoverlapgraphreduction
AT songmyoungkyu usingapachesparkongenomeassemblyforscalableoverlapgraphreduction
AT limseunghwan usingapachesparkongenomeassemblyforscalableoverlapgraphreduction
AT panchongle usingapachesparkongenomeassemblyforscalableoverlapgraphreduction
AT ahntaehyuk usingapachesparkongenomeassemblyforscalableoverlapgraphreduction