Cargando…
Using Apache Spark on genome assembly for scalable overlap-graph reduction
BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly d...
Autores principales: | , , , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6805285/ https://www.ncbi.nlm.nih.gov/pubmed/31639049 http://dx.doi.org/10.1186/s40246-019-0227-1 |
_version_ | 1783461346331852800 |
---|---|
author | Paul, Alexander J. Lawrence, Dylan Song, Myoungkyu Lim, Seung-Hwan Pan, Chongle Ahn, Tae-Hyuk |
author_facet | Paul, Alexander J. Lawrence, Dylan Song, Myoungkyu Lim, Seung-Hwan Pan, Chongle Ahn, Tae-Hyuk |
author_sort | Paul, Alexander J. |
collection | PubMed |
description | BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. RESULTS: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. CONCLUSIONS: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances. |
format | Online Article Text |
id | pubmed-6805285 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-68052852019-10-24 Using Apache Spark on genome assembly for scalable overlap-graph reduction Paul, Alexander J. Lawrence, Dylan Song, Myoungkyu Lim, Seung-Hwan Pan, Chongle Ahn, Tae-Hyuk Hum Genomics Research BACKGROUND: De novo genome assembly is a technique that builds the genome of a specimen using overlaps of genomic fragments without additional work with reference sequence. Sequence fragments (called reads) are assembled as contigs and scaffolds by the overlaps. The quality of the de novo assembly depends on the length and continuity of the assembly. To enable faster and more accurate assembly of species, existing sequencing techniques have been proposed, for example, high-throughput next-generation sequencing and long-reads-producing third-generation sequencing. However, these techniques require a large amounts of computer memory when very huge-size overlap graphs are resolved. Also, it is challenging for parallel computation. RESULTS: To address the limitations, we propose an innovative algorithmic approach, called Scalable Overlap-graph Reduction Algorithms (SORA). SORA is an algorithm package that performs string graph reduction algorithms by Apache Spark. The SORA’s implementations are designed to execute de novo genome assembly on either a single machine or a distributed computing platform. SORA efficiently compacts the number of edges on enormous graphing paths by adapting scalable features of graph processing libraries provided by Apache Spark, GraphX and GraphFrames. CONCLUSIONS: We shared the algorithms and the experimental results at our project website, https://github.com/BioHPC/SORA. We evaluated SORA with the human genome samples. First, it processed a nearly one billion edge graph on a distributed cloud cluster. Second, it processed mid-to-small size graphs on a single workstation within a short time frame. Overall, SORA achieved the linear-scaling simulations for the increased computing instances. BioMed Central 2019-10-22 /pmc/articles/PMC6805285/ /pubmed/31639049 http://dx.doi.org/10.1186/s40246-019-0227-1 Text en © Paul et al. 2019 Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. |
spellingShingle | Research Paul, Alexander J. Lawrence, Dylan Song, Myoungkyu Lim, Seung-Hwan Pan, Chongle Ahn, Tae-Hyuk Using Apache Spark on genome assembly for scalable overlap-graph reduction |
title | Using Apache Spark on genome assembly for scalable overlap-graph reduction |
title_full | Using Apache Spark on genome assembly for scalable overlap-graph reduction |
title_fullStr | Using Apache Spark on genome assembly for scalable overlap-graph reduction |
title_full_unstemmed | Using Apache Spark on genome assembly for scalable overlap-graph reduction |
title_short | Using Apache Spark on genome assembly for scalable overlap-graph reduction |
title_sort | using apache spark on genome assembly for scalable overlap-graph reduction |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6805285/ https://www.ncbi.nlm.nih.gov/pubmed/31639049 http://dx.doi.org/10.1186/s40246-019-0227-1 |
work_keys_str_mv | AT paulalexanderj usingapachesparkongenomeassemblyforscalableoverlapgraphreduction AT lawrencedylan usingapachesparkongenomeassemblyforscalableoverlapgraphreduction AT songmyoungkyu usingapachesparkongenomeassemblyforscalableoverlapgraphreduction AT limseunghwan usingapachesparkongenomeassemblyforscalableoverlapgraphreduction AT panchongle usingapachesparkongenomeassemblyforscalableoverlapgraphreduction AT ahntaehyuk usingapachesparkongenomeassemblyforscalableoverlapgraphreduction |