Cargando…

Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression

Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We...

Descripción completa

Detalles Bibliográficos
Autores principales: Liu, Yuansheng, Li, Jinyan
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Public Library of Science 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321399/
https://www.ncbi.nlm.nih.gov/pubmed/34280186
http://dx.doi.org/10.1371/journal.pcbi.1009229
_version_ 1783730841900285952
author Liu, Yuansheng
Li, Jinyan
author_facet Liu, Yuansheng
Li, Jinyan
author_sort Liu, Yuansheng
collection PubMed
description Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 − 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly.
format Online
Article
Text
id pubmed-8321399
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Public Library of Science
record_format MEDLINE/PubMed
spelling pubmed-83213992021-07-31 Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression Liu, Yuansheng Li, Jinyan PLoS Comput Biol Research Article Graphs such as de Bruijn graphs and OLC (overlap-layout-consensus) graphs have been widely adopted for the de novo assembly of genomic short reads. This work studies another important problem in the field: how graphs can be used for high-performance compression of the large-scale sequencing data. We present a novel graph definition named Hamming-Shifting graph to address this problem. The definition originates from the technological characteristics of next-generation sequencing machines, aiming to link all pairs of distinct reads that have a small Hamming distance or a small shifting offset or both. We compute multiple lexicographically minimal k-mers to index the reads for an efficient search of the weight-lightest edges, and we prove a very high probability of successfully detecting these edges. The resulted graph creates a full mutual reference of the reads to cascade a code-minimized transfer of every child-read for an optimal compression. We conducted compression experiments on the minimum spanning forest of this extremely sparse graph, and achieved a 10 − 30% more file size reduction compared to the best compression results using existing algorithms. As future work, the separation and connectivity degrees of these giant graphs can be used as economical measurements or protocols for quick quality assessment of wet-lab machines, for sufficiency control of genomic library preparation, and for accurate de novo genome assembly. Public Library of Science 2021-07-19 /pmc/articles/PMC8321399/ /pubmed/34280186 http://dx.doi.org/10.1371/journal.pcbi.1009229 Text en © 2021 Liu, Li https://creativecommons.org/licenses/by/4.0/This is an open access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle Research Article
Liu, Yuansheng
Li, Jinyan
Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression
title Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression
title_full Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression
title_fullStr Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression
title_full_unstemmed Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression
title_short Hamming-shifting graph of genomic short reads: Efficient construction and its application for compression
title_sort hamming-shifting graph of genomic short reads: efficient construction and its application for compression
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8321399/
https://www.ncbi.nlm.nih.gov/pubmed/34280186
http://dx.doi.org/10.1371/journal.pcbi.1009229
work_keys_str_mv AT liuyuansheng hammingshiftinggraphofgenomicshortreadsefficientconstructionanditsapplicationforcompression
AT lijinyan hammingshiftinggraphofgenomicshortreadsefficientconstructionanditsapplicationforcompression