Cargando…
Theory of local k-mer selection with applications to long-read alignment
MOTIVATION: Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2021
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9563685/ https://www.ncbi.nlm.nih.gov/pubmed/36124869 http://dx.doi.org/10.1093/bioinformatics/btab790 |
_version_ | 1784808462982578176 |
---|---|
author | Shaw, Jim Yu, Yun William |
author_facet | Shaw, Jim Yu, Yun William |
author_sort | Shaw, Jim |
collection | PubMed |
description | MOTIVATION: Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perform well. RESULTS: We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. However, we found that the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment. We give new insight into how one might use new k-mer selection methods as a reparameterization to optimize for speed and alignment quality. AVAILABILITY AND IMPLEMENTATION: Simulations and supplementary methods are available at https://github.com/bluenote-1577/local-kmer-selection-results. os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. |
format | Online Article Text |
id | pubmed-9563685 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2021 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-95636852022-10-18 Theory of local k-mer selection with applications to long-read alignment Shaw, Jim Yu, Yun William Bioinformatics Original Papers MOTIVATION: Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the ‘lowest-ordered’ k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perform well. RESULTS: We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. However, we found that the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment. We give new insight into how one might use new k-mer selection methods as a reparameterization to optimize for speed and alignment quality. AVAILABILITY AND IMPLEMENTATION: Simulations and supplementary methods are available at https://github.com/bluenote-1577/local-kmer-selection-results. os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2021-11-19 /pmc/articles/PMC9563685/ /pubmed/36124869 http://dx.doi.org/10.1093/bioinformatics/btab790 Text en © The Author(s) 2021. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Original Papers Shaw, Jim Yu, Yun William Theory of local k-mer selection with applications to long-read alignment |
title | Theory of local k-mer selection with applications to long-read alignment |
title_full | Theory of local k-mer selection with applications to long-read alignment |
title_fullStr | Theory of local k-mer selection with applications to long-read alignment |
title_full_unstemmed | Theory of local k-mer selection with applications to long-read alignment |
title_short | Theory of local k-mer selection with applications to long-read alignment |
title_sort | theory of local k-mer selection with applications to long-read alignment |
topic | Original Papers |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9563685/ https://www.ncbi.nlm.nih.gov/pubmed/36124869 http://dx.doi.org/10.1093/bioinformatics/btab790 |
work_keys_str_mv | AT shawjim theoryoflocalkmerselectionwithapplicationstolongreadalignment AT yuyunwilliam theoryoflocalkmerselectionwithapplicationstolongreadalignment |