Cargando…

ReCoil - an algorithm for compression of extremely large datasets of dna data

The growing volume of generated DNA sequencing data makes the problem of its long term storage increasingly important. In this work we present ReCoil - an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data. Typically each position of DN...

Descripción completa

Detalles Bibliográficos
Autor principal: Yanovsky, Vladimir
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3219593/
https://www.ncbi.nlm.nih.gov/pubmed/21988957
http://dx.doi.org/10.1186/1748-7188-6-23
_version_ 1782216856186126336
author Yanovsky, Vladimir
author_facet Yanovsky, Vladimir
author_sort Yanovsky, Vladimir
collection PubMed
description The growing volume of generated DNA sequencing data makes the problem of its long term storage increasingly important. In this work we present ReCoil - an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data. Typically each position of DNA sequence is covered by multiple reads of a short read dataset and our algorithm makes use of resulting redundancy to achieve high compression rate. While compression based on encoding mismatches between the dataset and a similar reference can yield high compression rate, good quality reference sequence may be unavailable. Instead, ReCoil's compression is based on encoding the differences between similar or overlapping reads. As such reads may appear at large distances from each other in the dataset and since random access memory is a limited resource, ReCoil is designed to work efficiently in external memory, leveraging high bandwidth of modern hard disk drives.
format Online
Article
Text
id pubmed-3219593
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-32195932011-11-18 ReCoil - an algorithm for compression of extremely large datasets of dna data Yanovsky, Vladimir Algorithms Mol Biol Research The growing volume of generated DNA sequencing data makes the problem of its long term storage increasingly important. In this work we present ReCoil - an I/O efficient external memory algorithm designed for compression of very large collections of short reads DNA data. Typically each position of DNA sequence is covered by multiple reads of a short read dataset and our algorithm makes use of resulting redundancy to achieve high compression rate. While compression based on encoding mismatches between the dataset and a similar reference can yield high compression rate, good quality reference sequence may be unavailable. Instead, ReCoil's compression is based on encoding the differences between similar or overlapping reads. As such reads may appear at large distances from each other in the dataset and since random access memory is a limited resource, ReCoil is designed to work efficiently in external memory, leveraging high bandwidth of modern hard disk drives. BioMed Central 2011-10-11 /pmc/articles/PMC3219593/ /pubmed/21988957 http://dx.doi.org/10.1186/1748-7188-6-23 Text en Copyright ©2011 Yanovsky; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Yanovsky, Vladimir
ReCoil - an algorithm for compression of extremely large datasets of dna data
title ReCoil - an algorithm for compression of extremely large datasets of dna data
title_full ReCoil - an algorithm for compression of extremely large datasets of dna data
title_fullStr ReCoil - an algorithm for compression of extremely large datasets of dna data
title_full_unstemmed ReCoil - an algorithm for compression of extremely large datasets of dna data
title_short ReCoil - an algorithm for compression of extremely large datasets of dna data
title_sort recoil - an algorithm for compression of extremely large datasets of dna data
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3219593/
https://www.ncbi.nlm.nih.gov/pubmed/21988957
http://dx.doi.org/10.1186/1748-7188-6-23
work_keys_str_mv AT yanovskyvladimir recoilanalgorithmforcompressionofextremelylargedatasetsofdnadata