Cargando…

Sequence Factorization with Multiple References

The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal wi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Wandelt, Sebastian, Leser, Ulf
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4589410/ https://www.ncbi.nlm.nih.gov/pubmed/26422374 http://dx.doi.org/10.1371/journal.pone.0139000

_version_	1782392788847951872
author	Wandelt, Sebastian Leser, Ulf
author_facet	Wandelt, Sebastian Leser, Ulf
author_sort	Wandelt, Sebastian
collection	PubMed
description	The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization.
format	Online Article Text
id	pubmed-4589410
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-45894102015-10-02 Sequence Factorization with Multiple References Wandelt, Sebastian Leser, Ulf PLoS One Research Article The success of high-throughput sequencing has lead to an increasing number of projects which sequence large populations of a species. Storage and analysis of sequence data is a key challenge in these projects, because of the sheer size of the datasets. Compression is one simple technology to deal with this challenge. Referential factorization and compression schemes, which store only the differences between input sequence and a reference sequence, gained lots of interest in this field. Highly-similar sequences, e.g., Human genomes, can be compressed with a compression ratio of 1,000:1 and more, up to two orders of magnitude better than with standard compression techniques. Recently, it was shown that the compression against multiple references from the same species can boost the compression ratio up to 4,000:1. However, a detailed analysis of using multiple references is lacking, e.g., for main memory consumption and optimality. In this paper, we describe one key technique for the referential compression against multiple references: The factorization of sequences. Based on the notion of an optimal factorization, we propose optimization heuristics and identify parameter settings which greatly influence 1) the size of the factorization, 2) the time for factorization, and 3) the required amount of main memory. We evaluate a total of 30 setups with a varying number of references on data from three different species. Our results show a wide range of factorization sizes (optimal to an overhead of up to 300%), factorization speed (0.01 MB/s to more than 600 MB/s), and main memory usage (few dozen MB to dozens of GB). Based on our evaluation, we identify the best configurations for common use cases. Our evaluation shows that multi-reference factorization is much better than single-reference factorization. Public Library of Science 2015-09-30 /pmc/articles/PMC4589410/ /pubmed/26422374 http://dx.doi.org/10.1371/journal.pone.0139000 Text en © 2015 Wandelt, Leser http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Wandelt, Sebastian Leser, Ulf Sequence Factorization with Multiple References
title	Sequence Factorization with Multiple References
title_full	Sequence Factorization with Multiple References
title_fullStr	Sequence Factorization with Multiple References
title_full_unstemmed	Sequence Factorization with Multiple References
title_short	Sequence Factorization with Multiple References
title_sort	sequence factorization with multiple references
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4589410/ https://www.ncbi.nlm.nih.gov/pubmed/26422374 http://dx.doi.org/10.1371/journal.pone.0139000
work_keys_str_mv	AT wandeltsebastian sequencefactorizationwithmultiplereferences AT leserulf sequencefactorizationwithmultiplereferences

Sequence Factorization with Multiple References

Ejemplares similares