Cargando…

Sequence-specific minimizers via polar sets

MOTIVATION: Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting set...

Descripción completa

Detalles Bibliográficos
Autores principales: Zheng, Hongyu, Kingsford, Carl, Marçais, Guillaume
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2021
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8686682/
https://www.ncbi.nlm.nih.gov/pubmed/34252928
http://dx.doi.org/10.1093/bioinformatics/btab313
_version_ 1784618064000581632
author Zheng, Hongyu
Kingsford, Carl
Marçais, Guillaume
author_facet Zheng, Hongyu
Kingsford, Carl
Marçais, Guillaume
author_sort Zheng, Hongyu
collection PubMed
description MOTIVATION: Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. RESULTS: We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. AVAILABILITY AND IMPLEMENTATION: A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-8686682
institution National Center for Biotechnology Information
language English
publishDate 2021
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-86866822021-12-21 Sequence-specific minimizers via polar sets Zheng, Hongyu Kingsford, Carl Marçais, Guillaume Bioinformatics Genome Sequence Analysis MOTIVATION: Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. RESULTS: We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. AVAILABILITY AND IMPLEMENTATION: A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-07-12 /pmc/articles/PMC8686682/ /pubmed/34252928 http://dx.doi.org/10.1093/bioinformatics/btab313 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Genome Sequence Analysis
Zheng, Hongyu
Kingsford, Carl
Marçais, Guillaume
Sequence-specific minimizers via polar sets
title Sequence-specific minimizers via polar sets
title_full Sequence-specific minimizers via polar sets
title_fullStr Sequence-specific minimizers via polar sets
title_full_unstemmed Sequence-specific minimizers via polar sets
title_short Sequence-specific minimizers via polar sets
title_sort sequence-specific minimizers via polar sets
topic Genome Sequence Analysis
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8686682/
https://www.ncbi.nlm.nih.gov/pubmed/34252928
http://dx.doi.org/10.1093/bioinformatics/btab313
work_keys_str_mv AT zhenghongyu sequencespecificminimizersviapolarsets
AT kingsfordcarl sequencespecificminimizersviapolarsets
AT marcaisguillaume sequencespecificminimizersviapolarsets