Cargando…

Sparse Binary Relation Representations for Genome Graph Annotation

High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient quer...

Descripción completa

Detalles Bibliográficos
Autores principales: Karasikov, Mikhail, Mustafa, Harun, Joudaki, Amir, Javadzadeh-no, Sara, Rätsch, Gunnar, Kahles, André
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Mary Ann Liebert, Inc., publishers 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7185347/
https://www.ncbi.nlm.nih.gov/pubmed/31891531
http://dx.doi.org/10.1089/cmb.2019.0324
_version_ 1783526743039016960
author Karasikov, Mikhail
Mustafa, Harun
Joudaki, Amir
Javadzadeh-no, Sara
Rätsch, Gunnar
Kahles, André
author_facet Karasikov, Mikhail
Mustafa, Harun
Joudaki, Amir
Javadzadeh-no, Sara
Rätsch, Gunnar
Kahles, André
author_sort Karasikov, Mikhail
collection PubMed
description High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.
format Online
Article
Text
id pubmed-7185347
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Mary Ann Liebert, Inc., publishers
record_format MEDLINE/PubMed
spelling pubmed-71853472020-05-06 Sparse Binary Relation Representations for Genome Graph Annotation Karasikov, Mikhail Mustafa, Harun Joudaki, Amir Javadzadeh-no, Sara Rätsch, Gunnar Kahles, André J Comput Biol Research Articles High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets. Mary Ann Liebert, Inc., publishers 2020-04-01 2020-04-08 /pmc/articles/PMC7185347/ /pubmed/31891531 http://dx.doi.org/10.1089/cmb.2019.0324 Text en © Mikhail Karasikov, et al., 2019. Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons Attribution Noncommercial License (http://creativecommons.org/licenses/by-nc/4.0/) which permits any noncommercial use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.
spellingShingle Research Articles
Karasikov, Mikhail
Mustafa, Harun
Joudaki, Amir
Javadzadeh-no, Sara
Rätsch, Gunnar
Kahles, André
Sparse Binary Relation Representations for Genome Graph Annotation
title Sparse Binary Relation Representations for Genome Graph Annotation
title_full Sparse Binary Relation Representations for Genome Graph Annotation
title_fullStr Sparse Binary Relation Representations for Genome Graph Annotation
title_full_unstemmed Sparse Binary Relation Representations for Genome Graph Annotation
title_short Sparse Binary Relation Representations for Genome Graph Annotation
title_sort sparse binary relation representations for genome graph annotation
topic Research Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7185347/
https://www.ncbi.nlm.nih.gov/pubmed/31891531
http://dx.doi.org/10.1089/cmb.2019.0324
work_keys_str_mv AT karasikovmikhail sparsebinaryrelationrepresentationsforgenomegraphannotation
AT mustafaharun sparsebinaryrelationrepresentationsforgenomegraphannotation
AT joudakiamir sparsebinaryrelationrepresentationsforgenomegraphannotation
AT javadzadehnosara sparsebinaryrelationrepresentationsforgenomegraphannotation
AT ratschgunnar sparsebinaryrelationrepresentationsforgenomegraphannotation
AT kahlesandre sparsebinaryrelationrepresentationsforgenomegraphannotation