Cargando…

Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters

Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-me...

Descripción completa

Detalles Bibliográficos
Autores principales: Pellow, David, Filippova, Darya, Kingsford, Carl
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Mary Ann Liebert, Inc. 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467106/
https://www.ncbi.nlm.nih.gov/pubmed/27828710
http://dx.doi.org/10.1089/cmb.2016.0155
_version_ 1783243212294455296
author Pellow, David
Filippova, Darya
Kingsford, Carl
author_facet Pellow, David
Filippova, Darya
Kingsford, Carl
author_sort Pellow, David
collection PubMed
description Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations.
format Online
Article
Text
id pubmed-5467106
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Mary Ann Liebert, Inc.
record_format MEDLINE/PubMed
spelling pubmed-54671062017-06-14 Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters Pellow, David Filippova, Darya Kingsford, Carl J Comput Biol Research Articles Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations. Mary Ann Liebert, Inc. 2017-06-01 2017-06-01 /pmc/articles/PMC5467106/ /pubmed/27828710 http://dx.doi.org/10.1089/cmb.2016.0155 Text en © David Pellow, et al., 2016. Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited.
spellingShingle Research Articles
Pellow, David
Filippova, Darya
Kingsford, Carl
Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_full Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_fullStr Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_full_unstemmed Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_short Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
title_sort improving bloom filter performance on sequence data using k-mer bloom filters
topic Research Articles
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467106/
https://www.ncbi.nlm.nih.gov/pubmed/27828710
http://dx.doi.org/10.1089/cmb.2016.0155
work_keys_str_mv AT pellowdavid improvingbloomfilterperformanceonsequencedatausingkmerbloomfilters
AT filippovadarya improvingbloomfilterperformanceonsequencedatausingkmerbloomfilters
AT kingsfordcarl improvingbloomfilterperformanceonsequencedatausingkmerbloomfilters