Cargando…
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression
Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Public Library of Science
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321399/ https://www.ncbi.nlm.nih.gov/pubmed/34280186 http://dx.doi.org/10.1371/journal.pcbi.1009229 |
_version_ | 1783730841900285952 |
---|---|
author | Liu, Yuansheng Li, Jinyan |
author_facet | Liu, Yuansheng Li, Jinyan |
author_sort | Liu, Yuansheng |
collection | PubMed |
description | Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 − 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly. |
format | Online Article Text |
id | pubmed-8321399 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Public Library of Science |
record_format | MEDLINE/PubMed |
spelling | pubmed-83213992021-07-31 Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression Liu, Yuansheng Li, Jinyan PLoS Comput Biol Research Article Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 − 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly. Public Library of Science 2021-07-19 /pmc/articles/PMC8321399/ /pubmed/34280186 http://dx.doi.org/10.1371/journal.pcbi.1009229 Text en © 2021 Liu, Li https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. |
spellingShingle | Research Article Liu, Yuansheng Li, Jinyan Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression |
title | Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression |
title_full | Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression |
title_fullStr | Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression |
title_full_unstemmed | Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression |
title_short | Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression |
title_sort | hamming-shifting graph of genomic short reads: efficient construction and its application for compression |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321399/ https://www.ncbi.nlm.nih.gov/pubmed/34280186 http://dx.doi.org/10.1371/journal.pcbi.1009229 |
work_keys_str_mv | AT liuyuansheng hammingshiftinggraphofgenomicshortreadsefficientconstructionanditsapplicationforcompression AT lijinyan hammingshiftinggraphofgenomicshortreadsefficientconstructionanditsapplicationforcompression |