Cargando…

HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads

BACKGROUND AND OBJECTIVE: Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, t...

Descripción completa

Detalles Bibliográficos
Autores principales: Li, Pinghao, Jiang, Xiaoqian, Wang, Shuang, Kim, Jihoon, Xiong, Hongkai, Ohno-Machado, Lucila
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BMJ Publishing Group 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3932469/
https://www.ncbi.nlm.nih.gov/pubmed/24368726
http://dx.doi.org/10.1136/amiajnl-2013-002147
_version_ 1782304799914459136
author Li, Pinghao
Jiang, Xiaoqian
Wang, Shuang
Kim, Jihoon
Xiong, Hongkai
Ohno-Machado, Lucila
author_facet Li, Pinghao
Jiang, Xiaoqian
Wang, Shuang
Kim, Jihoon
Xiong, Hongkai
Ohno-Machado, Lucila
author_sort Li, Pinghao
collection PubMed
description BACKGROUND AND OBJECTIVE: Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data. METHODS: We developed Hierarchical mUlti-reference Genome cOmpression (HUGO), a novel compression algorithm for aligned reads in the sorted Sequence Alignment/Map (SAM) format. We first aligned short reads against a reference genome and stored exactly mapped reads for compression. For the inexact mapped or unmapped reads, we realigned them against different reference genomes using an adaptive scheme by gradually shortening the read length. Regarding the base quality value, we offer lossy and lossless compression mechanisms. The lossy compression mechanism for the base quality values uses k-means clustering, where a user can adjust the balance between decompression quality and compression rate. The lossless compression can be produced by setting k (the number of clusters) to the number of different quality values. RESULTS: The proposed method produced a compression ratio in the range 0.5–0.65, which corresponds to 35–50% storage savings based on experimental datasets. The proposed approach achieved 15% more storage savings over CRAM and comparable compression ratio with Samcomp (CRAM and Samcomp are two of the state-of-the-art genome compression algorithms). The software is freely available at https://sourceforge.net/projects/hierachicaldnac/with a General Public License (GPL) license. LIMITATION: Our method requires having different reference genomes and prolongs the execution time for additional alignments. CONCLUSIONS: The proposed multi-reference-based compression algorithm for aligned reads outperforms existing single-reference based algorithms.
format Online
Article
Text
id pubmed-3932469
institution National Center for Biotechnology Information
language English
publishDate 2014
publisher BMJ Publishing Group
record_format MEDLINE/PubMed
spelling pubmed-39324692014-02-24 HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads Li, Pinghao Jiang, Xiaoqian Wang, Shuang Kim, Jihoon Xiong, Hongkai Ohno-Machado, Lucila J Am Med Inform Assoc Research and Applications BACKGROUND AND OBJECTIVE: Short-read sequencing is becoming the standard of practice for the study of structural variants associated with disease. However, with the growth of sequence data largely surpassing reasonable storage capability, the biomedical community is challenged with the management, transfer, archiving, and storage of sequence data. METHODS: We developed Hierarchical mUlti-reference Genome cOmpression (HUGO), a novel compression algorithm for aligned reads in the sorted Sequence Alignment/Map (SAM) format. We first aligned short reads against a reference genome and stored exactly mapped reads for compression. For the inexact mapped or unmapped reads, we realigned them against different reference genomes using an adaptive scheme by gradually shortening the read length. Regarding the base quality value, we offer lossy and lossless compression mechanisms. The lossy compression mechanism for the base quality values uses k-means clustering, where a user can adjust the balance between decompression quality and compression rate. The lossless compression can be produced by setting k (the number of clusters) to the number of different quality values. RESULTS: The proposed method produced a compression ratio in the range 0.5–0.65, which corresponds to 35–50% storage savings based on experimental datasets. The proposed approach achieved 15% more storage savings over CRAM and comparable compression ratio with Samcomp (CRAM and Samcomp are two of the state-of-the-art genome compression algorithms). The software is freely available at https://sourceforge.net/projects/hierachicaldnac/with a General Public License (GPL) license. LIMITATION: Our method requires having different reference genomes and prolongs the execution time for additional alignments. CONCLUSIONS: The proposed multi-reference-based compression algorithm for aligned reads outperforms existing single-reference based algorithms. BMJ Publishing Group 2014-03 2013-12-24 /pmc/articles/PMC3932469/ /pubmed/24368726 http://dx.doi.org/10.1136/amiajnl-2013-002147 Text en Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions This is an Open Access article distributed in accordance with the Creative Commons Attribution Non Commercial (CC BY-NC 3.0) license, which permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited and the use is non-commercial. See: http://creativecommons.org/licenses/by-nc/3.0/
spellingShingle Research and Applications
Li, Pinghao
Jiang, Xiaoqian
Wang, Shuang
Kim, Jihoon
Xiong, Hongkai
Ohno-Machado, Lucila
HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads
title HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads
title_full HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads
title_fullStr HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads
title_full_unstemmed HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads
title_short HUGO: Hierarchical mUlti-reference Genome cOmpression for aligned reads
title_sort hugo: hierarchical multi-reference genome compression for aligned reads
topic Research and Applications
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3932469/
https://www.ncbi.nlm.nih.gov/pubmed/24368726
http://dx.doi.org/10.1136/amiajnl-2013-002147
work_keys_str_mv AT lipinghao hugohierarchicalmultireferencegenomecompressionforalignedreads
AT jiangxiaoqian hugohierarchicalmultireferencegenomecompressionforalignedreads
AT wangshuang hugohierarchicalmultireferencegenomecompressionforalignedreads
AT kimjihoon hugohierarchicalmultireferencegenomecompressionforalignedreads
AT xionghongkai hugohierarchicalmultireferencegenomecompressionforalignedreads
AT ohnomachadolucila hugohierarchicalmultireferencegenomecompressionforalignedreads