Cargando…

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-me...

Descripción completa

Detalles Bibliográficos
Autores principales:	Pellow, David, Filippova, Darya, Kingsford, Carl
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Mary Ann Liebert, Inc. 2017
Materias:	Research Articles
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467106/ https://www.ncbi.nlm.nih.gov/pubmed/27828710 http://dx.doi.org/10.1089/cmb.2016.0155

_version_	1783243212294455296
author	Pellow, David Filippova, Darya Kingsford, Carl
author_facet	Pellow, David Filippova, Darya Kingsford, Carl
author_sort	Pellow, David
collection	PubMed
description	Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.
format	Online Article Text
id	pubmed-5467106
institution	National Center for Biotechnology Information
language	English
publishDate	2017
publisher	Mary Ann Liebert, Inc.
record_format	MEDLINE/PubMed
spelling	pubmed-54671062017-06-14 Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters Pellow, David Filippova, Darya Kingsford, Carl J Comput Biol Research Articles Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations. Mary Ann Liebert, Inc. 2017-06-01 2017-06-01 /pmc/articles/PMC5467106/ /pubmed/27828710 http://dx.doi.org/10.1089/cmb.2016.0155 Text en © David Pellow, et al., 2016. Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle	Research Articles Pellow, David Filippova, Darya Kingsford, Carl Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title	Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_full	Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_fullStr	Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_full_unstemmed	Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_short	Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_sort	improving bloom filter performance on sequence data using k-mer bloom filters
topic	Research Articles
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467106/ https://www.ncbi.nlm.nih.gov/pubmed/27828710 http://dx.doi.org/10.1089/cmb.2016.0155
work_keys_str_mv	AT pellowdavid improvingbloomfilterperformanceonsequencedatausingkmerbloomfilters AT filippovadarya improvingbloomfilterperformanceonsequencedatausingkmerbloomfilters AT kingsfordcarl improvingbloomfilterperformanceonsequencedatausingkmerbloomfilters

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

Ejemplares similares