Cargando…

Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval

Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-ind...

Descripción completa

Detalles Bibliográficos
Autores principales: Kuruppu, Shanika, Puglisi, Simon J., Zobel, Justin
Formato: Online Artículo Texto
Lenguaje:English
Publicado: 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121420/
http://dx.doi.org/10.1007/978-3-642-16321-0_20
Descripción
Sumario:Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just [Formula: see text] bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires [Formula: see text] time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory.