Cargando…

MBGC: Multiple Bacteria Genome Compressor

BACKGROUND: Genomes within the same species reveal large similarity, exploited by specialized multiple genome compressors. The existing algorithms and tools are however targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is rather moderate. RESULTS: In this work, w...

Descripción completa

Detalles Bibliográficos
Autores principales: Grabowski, Szymon, Kowalski, Tomasz M
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8848312/
https://www.ncbi.nlm.nih.gov/pubmed/35084032
http://dx.doi.org/10.1093/gigascience/giab099
_version_ 1784652222564401152
author Grabowski, Szymon
Kowalski, Tomasz M
author_facet Grabowski, Szymon
Kowalski, Tomasz M
author_sort Grabowski, Szymon
collection PubMed
description BACKGROUND: Genomes within the same species reveal large similarity, exploited by specialized multiple genome compressors. The existing algorithms and tools are however targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is rather moderate. RESULTS: In this work, we propose MBGC, a specialized genome compressor making use of specific redundancy of bacterial genomes. Its characteristic features are finding both direct and reverse-complemented LZ-matches, as well as a careful management of a reference buffer in a multi-threaded implementation. Our tool is not only compression efficient but also fast. On a collection of 168,311 bacterial genomes, totalling 587 GB, we achieve a compression ratio of approximately a factor of 1,265 and compression (respectively decompression) speed of ∼1,580 MB/s (respectively 780 MB/s) using 8 hardware threads, on a computer with a 14-core/28-thread CPU and a fast SSD, being almost 3 times more succinct and >6 times faster in the compression than the next best competitor.
format Online
Article
Text
id pubmed-8848312
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-88483122022-02-17 MBGC: Multiple Bacteria Genome Compressor Grabowski, Szymon Kowalski, Tomasz M Gigascience Research BACKGROUND: Genomes within the same species reveal large similarity, exploited by specialized multiple genome compressors. The existing algorithms and tools are however targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is rather moderate. RESULTS: In this work, we propose MBGC, a specialized genome compressor making use of specific redundancy of bacterial genomes. Its characteristic features are finding both direct and reverse-complemented LZ-matches, as well as a careful management of a reference buffer in a multi-threaded implementation. Our tool is not only compression efficient but also fast. On a collection of 168,311 bacterial genomes, totalling 587 GB, we achieve a compression ratio of approximately a factor of 1,265 and compression (respectively decompression) speed of ∼1,580 MB/s (respectively 780 MB/s) using 8 hardware threads, on a computer with a 14-core/28-thread CPU and a fast SSD, being almost 3 times more succinct and >6 times faster in the compression than the next best competitor. Oxford University Press 2022-01-27 /pmc/articles/PMC8848312/ /pubmed/35084032 http://dx.doi.org/10.1093/gigascience/giab099 Text en © The Author(s) 2022. Published by Oxford University Press GigaScience. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research
Grabowski, Szymon
Kowalski, Tomasz M
MBGC: Multiple Bacteria Genome Compressor
title MBGC: Multiple Bacteria Genome Compressor
title_full MBGC: Multiple Bacteria Genome Compressor
title_fullStr MBGC: Multiple Bacteria Genome Compressor
title_full_unstemmed MBGC: Multiple Bacteria Genome Compressor
title_short MBGC: Multiple Bacteria Genome Compressor
title_sort mbgc: multiple bacteria genome compressor
topic Research
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8848312/
https://www.ncbi.nlm.nih.gov/pubmed/35084032
http://dx.doi.org/10.1093/gigascience/giab099
work_keys_str_mv AT grabowskiszymon mbgcmultiplebacteriagenomecompressor
AT kowalskitomaszm mbgcmultiplebacteriagenomecompressor