Cargando…

Matchtigs: minimum plain text representation of k-mer sets

We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to...

Descripción completa

Detalles Bibliográficos
Autores principales: Schmidt, Sebastian, Khan, Shahbaz, Alanko, Jarno N., Pibiri, Giulio E., Tomescu, Alexandru I.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10251615/
https://www.ncbi.nlm.nih.gov/pubmed/37296461
http://dx.doi.org/10.1186/s13059-023-02968-z
Descripción
Sumario:We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work. SUPPLEMENTARY INFORMATION: The online version contains supplementary material available at 10.1186/s13059-023-02968-z.