Cargando…

Sketching methods with small window guarantee using minimum decycling sets

Most sequence sketching methods work by selecting specific [Formula: see text]-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching meth...

Descripción completa

Detalles Bibliográficos
Autores principales: Marçais, Guillaume, DeBlasio, Dan, Kingsford, Carl
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cornell University 2023
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10659450/
https://www.ncbi.nlm.nih.gov/pubmed/37986724
_version_ 1785137579374411776
author Marçais, Guillaume
DeBlasio, Dan
Kingsford, Carl
author_facet Marçais, Guillaume
DeBlasio, Dan
Kingsford, Carl
author_sort Marçais, Guillaume
collection PubMed
description Most sequence sketching methods work by selecting specific [Formula: see text]-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the [Formula: see text]-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a Decycling Set, an unavoidable sets of [Formula: see text]-mers. Any long enough sequence, by definition, must contain a [Formula: see text]-mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set also defines a sketching method by choosing the [Formula: see text]-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.
format Online
Article
Text
id pubmed-10659450
institution National Center for Biotechnology Information
language English
publishDate 2023
publisher Cornell University
record_format MEDLINE/PubMed
spelling pubmed-106594502023-11-06 Sketching methods with small window guarantee using minimum decycling sets Marçais, Guillaume DeBlasio, Dan Kingsford, Carl ArXiv Article Most sequence sketching methods work by selecting specific [Formula: see text]-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the [Formula: see text]-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a Decycling Set, an unavoidable sets of [Formula: see text]-mers. Any long enough sequence, by definition, must contain a [Formula: see text]-mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set also defines a sketching method by choosing the [Formula: see text]-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope. Cornell University 2023-11-06 /pmc/articles/PMC10659450/ /pubmed/37986724 Text en https://creativecommons.org/licenses/by/4.0/This work is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/) , which allows reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.
spellingShingle Article
Marçais, Guillaume
DeBlasio, Dan
Kingsford, Carl
Sketching methods with small window guarantee using minimum decycling sets
title Sketching methods with small window guarantee using minimum decycling sets
title_full Sketching methods with small window guarantee using minimum decycling sets
title_fullStr Sketching methods with small window guarantee using minimum decycling sets
title_full_unstemmed Sketching methods with small window guarantee using minimum decycling sets
title_short Sketching methods with small window guarantee using minimum decycling sets
title_sort sketching methods with small window guarantee using minimum decycling sets
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10659450/
https://www.ncbi.nlm.nih.gov/pubmed/37986724
work_keys_str_mv AT marcaisguillaume sketchingmethodswithsmallwindowguaranteeusingminimumdecyclingsets
AT deblasiodan sketchingmethodswithsmallwindowguaranteeusingminimumdecyclingsets
AT kingsfordcarl sketchingmethodswithsmallwindowguaranteeusingminimumdecyclingsets