Cargando…

Improving the performance of minimizers and winnowing schemes

MOTIVATION: The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing...

Descripción completa

Detalles Bibliográficos
Autores principales: Marçais, Guillaume, Pellow, David, Bork, Daniel, Orenstein, Yaron, Shamir, Ron, Kingsford, Carl
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870760/
https://www.ncbi.nlm.nih.gov/pubmed/28881970
http://dx.doi.org/10.1093/bioinformatics/btx235
_version_ 1783309545201729536
author Marçais, Guillaume
Pellow, David
Bork, Daniel
Orenstein, Yaron
Shamir, Ron
Kingsford, Carl
author_facet Marçais, Guillaume
Pellow, David
Bork, Daniel
Orenstein, Yaron
Shamir, Ron
Kingsford, Carl
author_sort Marçais, Guillaume
collection PubMed
description MOTIVATION: The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues. RESULTS: We provide an in-depth analysis of the effect of k-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al.) on the expected density of minimizers in a random sequence. AVAILABILITY AND IMPLEMENTATION: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git.
format Online
Article
Text
id pubmed-5870760
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-58707602018-04-05 Improving the performance of minimizers and winnowing schemes Marçais, Guillaume Pellow, David Bork, Daniel Orenstein, Yaron Shamir, Ron Kingsford, Carl Bioinformatics Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017 MOTIVATION: The minimizers scheme is a method for selecting k-mers from sequences. It is used in many bioinformatics software tools to bin comparable sequences or to sample a sequence in a deterministic fashion at approximately regular intervals, in order to reduce memory consumption and processing time. Although very useful, the minimizers selection procedure has undesirable behaviors (e.g. too many k-mers are selected when processing certain sequences). Some of these problems were already known to the authors of the minimizers technique, and the natural lexicographic ordering of k-mers used by minimizers was recognized as their origin. Many software tools using minimizers employ ad hoc variations of the lexicographic order to alleviate those issues. RESULTS: We provide an in-depth analysis of the effect of k-mer ordering on the performance of the minimizers technique. By using small universal hitting sets (a recently defined concept), we show how to significantly improve the performance of minimizers and avoid some of its worse behaviors. Based on these results, we encourage bioinformatics software developers to use an ordering based on a universal hitting set or, if not possible, a randomized ordering, rather than the lexicographic order. This analysis also settles negatively a conjecture (by Schleimer et al.) on the expected density of minimizers in a random sequence. AVAILABILITY AND IMPLEMENTATION: The software used for this analysis is available on GitHub: https://github.com/gmarcais/minimizers.git. Oxford University Press 2017-07-15 2017-07-12 /pmc/articles/PMC5870760/ /pubmed/28881970 http://dx.doi.org/10.1093/bioinformatics/btx235 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017
Marçais, Guillaume
Pellow, David
Bork, Daniel
Orenstein, Yaron
Shamir, Ron
Kingsford, Carl
Improving the performance of minimizers and winnowing schemes
title Improving the performance of minimizers and winnowing schemes
title_full Improving the performance of minimizers and winnowing schemes
title_fullStr Improving the performance of minimizers and winnowing schemes
title_full_unstemmed Improving the performance of minimizers and winnowing schemes
title_short Improving the performance of minimizers and winnowing schemes
title_sort improving the performance of minimizers and winnowing schemes
topic Ismb/Eccb 2017: The 25th Annual Conference Intelligent Systems for Molecular Biology Held Jointly with the 16th Annual European Conference on Computational Biology, Prague, Czech Republic, July 21–25, 2017
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5870760/
https://www.ncbi.nlm.nih.gov/pubmed/28881970
http://dx.doi.org/10.1093/bioinformatics/btx235
work_keys_str_mv AT marcaisguillaume improvingtheperformanceofminimizersandwinnowingschemes
AT pellowdavid improvingtheperformanceofminimizersandwinnowingschemes
AT borkdaniel improvingtheperformanceofminimizersandwinnowingschemes
AT orensteinyaron improvingtheperformanceofminimizersandwinnowingschemes
AT shamirron improvingtheperformanceofminimizersandwinnowingschemes
AT kingsfordcarl improvingtheperformanceofminimizersandwinnowingschemes