Cargando…

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating sma...

Descripción completa

Detalles Bibliográficos
Autores principales:	Orenstein, Yaron, Pellow, David, Marçais, Guillaume, Shamir, Ron, Kingsford, Carl
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2017
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5645146/ https://www.ncbi.nlm.nih.gov/pubmed/28968408 http://dx.doi.org/10.1371/journal.pcbi.1005777

_version_	1783271844215783424
author	Orenstein, Yaron Pellow, David Marçais, Guillaume Shamir, Ron Kingsford, Carl
author_facet	Orenstein, Yaron Pellow, David Marçais, Guillaume Shamir, Ron Kingsford, Carl
author_sort	Orenstein, Yaron
collection	PubMed
description	With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively.
format	Online Article Text
id	pubmed-5645146
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-56451462017-10-30 Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing Orenstein, Yaron Pellow, David Marçais, Guillaume Shamir, Ron Kingsford, Carl PLoS Comput Biol Research Article With the rapidly increasing volume of deep sequencing data, more efficient algorithms and data structures are needed. Minimizers are a central recent paradigm that has improved various sequence analysis tasks, including hashing for faster read overlap detection, sparse suffix arrays for creating smaller indexes, and Bloom filters for speeding up sequence search. Here, we propose an alternative paradigm that can lead to substantial further improvement in these and other tasks. For integers k and L > k, we say that a set of k-mers is a universal hitting set (UHS) if every possible L-long sequence must contain a k-mer from the set. We develop a heuristic called DOCKS to find a compact UHS, which works in two phases: The first phase is solved optimally, and for the second we propose several efficient heuristics, trading set size for speed and memory. The use of heuristics is motivated by showing the NP-hardness of a closely related problem. We show that DOCKS works well in practice and produces UHSs that are very close to a theoretical lower bound. We present results for various values of k and L and by applying them to real genomes show that UHSs indeed improve over minimizers. In particular, DOCKS uses less than 30% of the 10-mers needed to span the human genome compared to minimizers. The software and computed UHSs are freely available at github.com/Shamir-Lab/DOCKS/ and acgt.cs.tau.ac.il/docks/, respectively. Public Library of Science 2017-10-02 /pmc/articles/PMC5645146/ /pubmed/28968408 http://dx.doi.org/10.1371/journal.pcbi.1005777 Text en © 2017 Orenstein et al http://creativecommons.org/licenses/by/4.0/ This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/) , which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
spellingShingle	Research Article Orenstein, Yaron Pellow, David Marçais, Guillaume Shamir, Ron Kingsford, Carl Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
title	Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
title_full	Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
title_fullStr	Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
title_full_unstemmed	Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
title_short	Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
title_sort	designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5645146/ https://www.ncbi.nlm.nih.gov/pubmed/28968408 http://dx.doi.org/10.1371/journal.pcbi.1005777
work_keys_str_mv	AT orensteinyaron designingsmalluniversalkmerhittingsetsforimprovedanalysisofhighthroughputsequencing AT pellowdavid designingsmalluniversalkmerhittingsetsforimprovedanalysisofhighthroughputsequencing AT marcaisguillaume designingsmalluniversalkmerhittingsetsforimprovedanalysisofhighthroughputsequencing AT shamirron designingsmalluniversalkmerhittingsetsforimprovedanalysisofhighthroughputsequencing AT kingsfordcarl designingsmalluniversalkmerhittingsetsforimprovedanalysisofhighthroughputsequencing

Designing small universal k-mer hitting sets for improved analysis of high-throughput sequencing

Ejemplares similares