Cargando…
Efficient counting of k-mers in DNA sequences using a bloom filter
BACKGROUND: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mer...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2011
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166945/ https://www.ncbi.nlm.nih.gov/pubmed/21831268 http://dx.doi.org/10.1186/1471-2105-12-333 |
_version_ | 1782211211750801408 |
---|---|
author | Melsted, Páll Pritchard, Jonathan K |
author_facet | Melsted, Páll Pritchard, Jonathan K |
author_sort | Melsted, Páll |
collection | PubMed |
description | BACKGROUND: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. RESULTS: We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. CONCLUSIONS: A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html |
format | Online Article Text |
id | pubmed-3166945 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2011 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-31669452011-09-06 Efficient counting of k-mers in DNA sequences using a bloom filter Melsted, Páll Pritchard, Jonathan K BMC Bioinformatics Methodology Article BACKGROUND: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. RESULTS: We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. CONCLUSIONS: A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html BioMed Central 2011-08-10 /pmc/articles/PMC3166945/ /pubmed/21831268 http://dx.doi.org/10.1186/1471-2105-12-333 Text en Copyright ©2011 Melsted and Pritchard; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Methodology Article Melsted, Páll Pritchard, Jonathan K Efficient counting of k-mers in DNA sequences using a bloom filter |
title | Efficient counting of k-mers in DNA sequences using a bloom filter |
title_full | Efficient counting of k-mers in DNA sequences using a bloom filter |
title_fullStr | Efficient counting of k-mers in DNA sequences using a bloom filter |
title_full_unstemmed | Efficient counting of k-mers in DNA sequences using a bloom filter |
title_short | Efficient counting of k-mers in DNA sequences using a bloom filter |
title_sort | efficient counting of k-mers in dna sequences using a bloom filter |
topic | Methodology Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166945/ https://www.ncbi.nlm.nih.gov/pubmed/21831268 http://dx.doi.org/10.1186/1471-2105-12-333 |
work_keys_str_mv | AT melstedpall efficientcountingofkmersindnasequencesusingabloomfilter AT pritchardjonathank efficientcountingofkmersindnasequencesusingabloomfilter |