Cargando…
Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters
Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-me...
Autores principales: | , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Mary Ann Liebert, Inc.
2017
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467106/ https://www.ncbi.nlm.nih.gov/pubmed/27828710 http://dx.doi.org/10.1089/cmb.2016.0155 |
_version_ | 1783243212294455296 |
---|---|
author | Pellow, David Filippova, Darya Kingsford, Carl |
author_facet | Pellow, David Filippova, Darya Kingsford, Carl |
author_sort | Pellow, David |
collection | PubMed |
description | Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations. |
format | Online Article Text |
id | pubmed-5467106 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2017 |
publisher | Mary Ann Liebert, Inc. |
record_format | MEDLINE/PubMed |
spelling | pubmed-54671062017-06-14 Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters Pellow, David Filippova, Darya Kingsford, Carl J Comput Biol Research Articles Using a sequence's k-mer content rather than the full sequence directly has enabled significant performance improvements in several sequencing applications, such as metagenomic species identification, estimation of transcript abundances, and alignment-free comparison of sequencing data. As k-mer sets often reach hundreds of millions of elements, traditional data structures are often impractical for k-mer set storage, and Bloom filters (BFs) and their variants are used instead. BFs reduce the memory footprint required to store millions of k-mers while allowing for fast set containment queries, at the cost of a low false positive rate (FPR). We show that, because k-mers are derived from sequencing reads, the information about k-mer overlap in the original sequence can be used to reduce the FPR up to 30 × with little or no additional memory and with set containment queries that are only 1.3 – 1.6 times slower. Alternatively, we can leverage k-mer overlap information to store k-mer sets in about half the space while maintaining the original FPR. We consider several variants of such k-mer Bloom filters (kBFs), derive theoretical upper bounds for their FPR, and discuss their range of applications and limitations. Mary Ann Liebert, Inc. 2017-06-01 2017-06-01 /pmc/articles/PMC5467106/ /pubmed/27828710 http://dx.doi.org/10.1089/cmb.2016.0155 Text en © David Pellow, et al., 2016. Published by Mary Ann Liebert, Inc. This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. |
spellingShingle | Research Articles Pellow, David Filippova, Darya Kingsford, Carl Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters |
title | Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters |
title_full | Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters |
title_fullStr | Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters |
title_full_unstemmed | Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters |
title_short | Improving Bloom Filter Performance on Sequence Data Using k-mer Bloom Filters |
title_sort | improving bloom filter performance on sequence data using k-mer bloom filters |
topic | Research Articles |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5467106/ https://www.ncbi.nlm.nih.gov/pubmed/27828710 http://dx.doi.org/10.1089/cmb.2016.0155 |
work_keys_str_mv | AT pellowdavid improvingbloomfilterperformanceonsequencedatausingkmerbloomfilters AT filippovadarya improvingbloomfilterperformanceonsequencedatausingkmerbloomfilters AT kingsfordcarl improvingbloomfilterperformanceonsequencedatausingkmerbloomfilters |