Cargando…

Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval

Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-ind...

Descripción completa

Detalles Bibliográficos
Autores principales: Kuruppu, Shanika, Puglisi, Simon J., Zobel, Justin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121420/
http://dx.doi.org/10.1007/978-3-642-16321-0_20
_version_ 1783515198416486400
author Kuruppu, Shanika
Puglisi, Simon J.
Zobel, Justin
author_facet Kuruppu, Shanika
Puglisi, Simon J.
Zobel, Justin
author_sort Kuruppu, Shanika
collection PubMed
description Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just [Formula: see text] bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires [Formula: see text] time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.
format Online
Article
Text
id pubmed-7121420
institution National Center for Biotechnology Information
language English
publishDate 2010
record_format MEDLINE/PubMed
spelling pubmed-71214202020-04-06 Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval Kuruppu, Shanika Puglisi, Simon J. Zobel, Justin String Processing and Information Retrieval Article Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just [Formula: see text] bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires [Formula: see text] time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory. 2010 /pmc/articles/PMC7121420/ http://dx.doi.org/10.1007/978-3-642-16321-0_20 Text en © Springer-Verlag Berlin Heidelberg 2010 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic.
spellingShingle Article
Kuruppu, Shanika
Puglisi, Simon J.
Zobel, Justin
Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
title Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
title_full Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
title_fullStr Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
title_full_unstemmed Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
title_short Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
title_sort relative lempel-ziv compression of genomes for large-scale storage and retrieval
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121420/
http://dx.doi.org/10.1007/978-3-642-16321-0_20
work_keys_str_mv AT kuruppushanika relativelempelzivcompressionofgenomesforlargescalestorageandretrieval
AT puglisisimonj relativelempelzivcompressionofgenomesforlargescalestorageandretrieval
AT zobeljustin relativelempelzivcompressionofgenomesforlargescalestorageandretrieval