Cargando…

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches

Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fi...

Descripción completa

Detalles Bibliográficos
Autores principales:	Almutairy, Meznah, Torng, Eric
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2018
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5794061/ https://www.ncbi.nlm.nih.gov/pubmed/29389989 http://dx.doi.org/10.1371/journal.pone.0189960

_version_	1783297050915373056
author	Almutairy, Meznah Torng, Eric
author_facet	Almutairy, Meznah Torng, Eric
author_sort	Almutairy, Meznah
collection	PubMed
description	Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method.
format	Online Article Text
id	pubmed-5794061
institution	National Center for Biotechnology Information
language	English
publishDate	2018
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-57940612018-02-09 Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches Almutairy, Meznah Torng, Eric PLoS One Research Article Bioinformatics applications and pipelines increasingly use k-mer indexes to search for similar sequences. The major problem with k-mer indexes is that they require lots of memory. Sampling is often used to reduce index size and query time. Most applications use one of two major types of sampling: fixed sampling and minimizer sampling. It is well known that fixed sampling will produce a smaller index, typically by roughly a factor of two, whereas it is generally assumed that minimizer sampling will produce faster query times since query k-mers can also be sampled. However, no direct comparison of fixed and minimizer sampling has been performed to verify these assumptions. We systematically compare fixed and minimizer sampling using the human genome as our database. We use the resulting k-mer indexes for fixed sampling and minimizer sampling to find all maximal exact matches between our database, the human genome, and three separate query sets, the mouse genome, the chimp genome, and an NGS data set. We reach the following conclusions. First, using larger k-mers reduces query time for both fixed sampling and minimizer sampling at a cost of requiring more space. If we use the same k-mer size for both methods, fixed sampling requires typically half as much space whereas minimizer sampling processes queries only slightly faster. If we are allowed to use any k-mer size for each method, then we can choose a k-mer size such that fixed sampling both uses less space and processes queries faster than minimizer sampling. The reason is that although minimizer sampling is able to sample query k-mers, the number of shared k-mer occurrences that must be processed is much larger for minimizer sampling than fixed sampling. In conclusion, we argue that for any application where each shared k-mer occurrence must be processed, fixed sampling is the right sampling method. Public Library of Science 2018-02-01 /pmc/articles/PMC5794061/ /pubmed/29389989 http://dx.doi.org/10.1371/journal.pone.0189960 Text en © 2018 Almutairy, Torng http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Almutairy, Meznah Torng, Eric Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches
title	Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches
title_full	Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches
title_fullStr	Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches
title_full_unstemmed	Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches
title_short	Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches
title_sort	comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5794061/ https://www.ncbi.nlm.nih.gov/pubmed/29389989 http://dx.doi.org/10.1371/journal.pone.0189960
work_keys_str_mv	AT almutairymeznah comparingfixedsamplingwithminimizersamplingwhenusingkmerindexestofindmaximalexactmatches AT torngeric comparingfixedsamplingwithminimizersamplingwhenusingkmerindexestofindmaximalexactmatches

Comparing fixed sampling with minimizer sampling when using k-mer indexes to find maximal exact matches

Ejemplares similares