Cargando…

Efficient counting of k-mers in DNA sequences using a bloom filter

BACKGROUND: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mer...

Descripción completa

Detalles Bibliográficos
Autores principales: Melsted, Páll, Pritchard, Jonathan K
Formato: Online Artículo Texto
Lenguaje:English
Publicado: BioMed Central 2011
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166945/
https://www.ncbi.nlm.nih.gov/pubmed/21831268
http://dx.doi.org/10.1186/1471-2105-12-333
_version_ 1782211211750801408
author Melsted, Páll
Pritchard, Jonathan K
author_facet Melsted, Páll
Pritchard, Jonathan K
author_sort Melsted, Páll
collection PubMed
description BACKGROUND: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. RESULTS: We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. CONCLUSIONS: A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html
format Online
Article
Text
id pubmed-3166945
institution National Center for Biotechnology Information
language English
publishDate 2011
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-31669452011-09-06 Efficient counting of k-mers in DNA sequences using a bloom filter Melsted, Páll Pritchard, Jonathan K BMC Bioinformatics Methodology Article BACKGROUND: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. RESULTS: We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. CONCLUSIONS: A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html BioMed Central 2011-08-10 /pmc/articles/PMC3166945/ /pubmed/21831268 http://dx.doi.org/10.1186/1471-2105-12-333 Text en Copyright ©2011 Melsted and Pritchard; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology Article
Melsted, Páll
Pritchard, Jonathan K
Efficient counting of k-mers in DNA sequences using a bloom filter
title Efficient counting of k-mers in DNA sequences using a bloom filter
title_full Efficient counting of k-mers in DNA sequences using a bloom filter
title_fullStr Efficient counting of k-mers in DNA sequences using a bloom filter
title_full_unstemmed Efficient counting of k-mers in DNA sequences using a bloom filter
title_short Efficient counting of k-mers in DNA sequences using a bloom filter
title_sort efficient counting of k-mers in dna sequences using a bloom filter
topic Methodology Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166945/
https://www.ncbi.nlm.nih.gov/pubmed/21831268
http://dx.doi.org/10.1186/1471-2105-12-333
work_keys_str_mv AT melstedpall efficientcountingofkmersindnasequencesusingabloomfilter
AT pritchardjonathank efficientcountingofkmersindnasequencesusingabloomfilter