Cargando…
Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval
Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-ind...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
2010
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121420/ http://dx.doi.org/10.1007/978-3-642-16321-0_20 |
_version_ | 1783515198416486400 |
---|---|
author | Kuruppu, Shanika Puglisi, Simon J. Zobel, Justin |
author_facet | Kuruppu, Shanika Puglisi, Simon J. Zobel, Justin |
author_sort | Kuruppu, Shanika |
collection | PubMed |
description | Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just [Formula: see text] bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires [Formula: see text] time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory. |
format | Online Article Text |
id | pubmed-7121420 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2010 |
record_format | MEDLINE/PubMed |
spelling | pubmed-71214202020-04-06 Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval Kuruppu, Shanika Puglisi, Simon J. Zobel, Justin String Processing and Information Retrieval Article Self-indexes – data structures that simultaneously provide fast search of and access to compressed text – are promising for genomic data but in their usual form are not able to exploit the high level of replication present in a collection of related genomes. Our ‘RLZ’ approach is to store a self-index for a base sequence and then compress every other sequence as an LZ77 encoding relative to the base. For a collection of r sequences totaling N bases, with a total of s point mutations from a base sequence of length n, this representation requires just [Formula: see text] bits. At the cost of negligible extra space, access to ℓ consecutive symbols requires [Formula: see text] time. Our experiments show that, for example, RLZ can represent individual human genomes in around 0.1 bits per base while supporting rapid access and using relatively little memory. 2010 /pmc/articles/PMC7121420/ http://dx.doi.org/10.1007/978-3-642-16321-0_20 Text en © Springer-Verlag Berlin Heidelberg 2010 This article is made available via the PMC Open Access Subset for unrestricted research re-use and secondary analysis in any form or by any means with acknowledgement of the original source. These permissions are granted for the duration of the World Health Organization (WHO) declaration of COVID-19 as a global pandemic. |
spellingShingle | Article Kuruppu, Shanika Puglisi, Simon J. Zobel, Justin Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval |
title | Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval |
title_full | Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval |
title_fullStr | Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval |
title_full_unstemmed | Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval |
title_short | Relative Lempel-Ziv Compression of Genomes for Large-Scale Storage and Retrieval |
title_sort | relative lempel-ziv compression of genomes for large-scale storage and retrieval |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7121420/ http://dx.doi.org/10.1007/978-3-642-16321-0_20 |
work_keys_str_mv | AT kuruppushanika relativelempelzivcompressionofgenomesforlargescalestorageandretrieval AT puglisisimonj relativelempelzivcompressionofgenomesforlargescalestorageandretrieval AT zobeljustin relativelempelzivcompressionofgenomesforlargescalestorageandretrieval |