Cargando…

Dynamic compression schemes for graph coloring

MOTIVATION: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due...

Descripción completa

Detalles Bibliográficos
Autores principales: Mustafa, Harun, Schilken, Ingo, Karasikov, Mikhail, Eickhoff, Carsten, Rätsch, Gunnar, Kahles, André
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6530811/
https://www.ncbi.nlm.nih.gov/pubmed/30020403
http://dx.doi.org/10.1093/bioinformatics/bty632
_version_ 1783420700736880640
author Mustafa, Harun
Schilken, Ingo
Karasikov, Mikhail
Eickhoff, Carsten
Rätsch, Gunnar
Kahles, André
author_facet Mustafa, Harun
Schilken, Ingo
Karasikov, Mikhail
Eickhoff, Carsten
Rätsch, Gunnar
Kahles, André
author_sort Mustafa, Harun
collection PubMed
description MOTIVATION: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. RESULTS: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. AVAILABILITY AND IMPLEMENTATION: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-6530811
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-65308112019-05-28 Dynamic compression schemes for graph coloring Mustafa, Harun Schilken, Ingo Karasikov, Mikhail Eickhoff, Carsten Rätsch, Gunnar Kahles, André Bioinformatics Original Papers MOTIVATION: Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. RESULTS: We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. AVAILABILITY AND IMPLEMENTATION: We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2019-02-01 2018-07-18 /pmc/articles/PMC6530811/ /pubmed/30020403 http://dx.doi.org/10.1093/bioinformatics/bty632 Text en © The Author(s) 2018. Published by Oxford University Press. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Papers
Mustafa, Harun
Schilken, Ingo
Karasikov, Mikhail
Eickhoff, Carsten
Rätsch, Gunnar
Kahles, André
Dynamic compression schemes for graph coloring
title Dynamic compression schemes for graph coloring
title_full Dynamic compression schemes for graph coloring
title_fullStr Dynamic compression schemes for graph coloring
title_full_unstemmed Dynamic compression schemes for graph coloring
title_short Dynamic compression schemes for graph coloring
title_sort dynamic compression schemes for graph coloring
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6530811/
https://www.ncbi.nlm.nih.gov/pubmed/30020403
http://dx.doi.org/10.1093/bioinformatics/bty632
work_keys_str_mv AT mustafaharun dynamiccompressionschemesforgraphcoloring
AT schilkeningo dynamiccompressionschemesforgraphcoloring
AT karasikovmikhail dynamiccompressionschemesforgraphcoloring
AT eickhoffcarsten dynamiccompressionschemesforgraphcoloring
AT ratschgunnar dynamiccompressionschemesforgraphcoloring
AT kahlesandre dynamiccompressionschemesforgraphcoloring