Cargando…

Efficient counting of k-mers in DNA sequences using a bloom filter

BACKGROUND: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mer...

Descripción completa

Detalles Bibliográficos
Autores principales:	Melsted, Páll, Pritchard, Jonathan K
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2011
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166945/ https://www.ncbi.nlm.nih.gov/pubmed/21831268 http://dx.doi.org/10.1186/1471-2105-12-333

_version_	1782211211750801408
author	Melsted, Páll Pritchard, Jonathan K
author_facet	Melsted, Páll Pritchard, Jonathan K
author_sort	Melsted, Páll
collection	PubMed
description	BACKGROUND: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. RESULTS: We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. CONCLUSIONS: A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html
format	Online Article Text
id	pubmed-3166945
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-31669452011-09-06 Efficient counting of k-mers in DNA sequences using a bloom filter Melsted, Páll Pritchard, Jonathan K BMC Bioinformatics Methodology Article BACKGROUND: Counting k-mers (substrings of length k in DNA sequence data) is an essential component of many methods in bioinformatics, including for genome and transcriptome assembly, for metagenomic sequencing, and for error correction of sequence reads. Although simple in principle, counting k-mers in large modern sequence data sets can easily overwhelm the memory capacity of standard computers. In current data sets, a large fraction-often more than 50%-of the storage capacity may be spent on storing k-mers that contain sequencing errors and which are typically observed only a single time in the data. These singleton k-mers are uninformative for many algorithms without some kind of error correction. RESULTS: We present a new method that identifies all the k-mers that occur more than once in a DNA sequence data set. Our method does this using a Bloom filter, a probabilistic data structure that stores all the observed k-mers implicitly in memory with greatly reduced memory requirements. We then make a second sweep through the data to provide exact counts of all nonunique k-mers. For example data sets, we report up to 50% savings in memory usage compared to current software, with modest costs in computational speed. This approach may reduce memory requirements for any algorithm that starts by counting k-mers in sequence data with errors. CONCLUSIONS: A reference implementation for this methodology, BFCounter, is written in C++ and is GPL licensed. It is available for free download at http://pritch.bsd.uchicago.edu/bfcounter.html BioMed Central 2011-08-10 /pmc/articles/PMC3166945/ /pubmed/21831268 http://dx.doi.org/10.1186/1471-2105-12-333 Text en Copyright ©2011 Melsted and Pritchard; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Methodology Article Melsted, Páll Pritchard, Jonathan K Efficient counting of k-mers in DNA sequences using a bloom filter
title	Efficient counting of k-mers in DNA sequences using a bloom filter
title_full	Efficient counting of k-mers in DNA sequences using a bloom filter
title_fullStr	Efficient counting of k-mers in DNA sequences using a bloom filter
title_full_unstemmed	Efficient counting of k-mers in DNA sequences using a bloom filter
title_short	Efficient counting of k-mers in DNA sequences using a bloom filter
title_sort	efficient counting of k-mers in dna sequences using a bloom filter
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3166945/ https://www.ncbi.nlm.nih.gov/pubmed/21831268 http://dx.doi.org/10.1186/1471-2105-12-333
work_keys_str_mv	AT melstedpall efficientcountingofkmersindnasequencesusingabloomfilter AT pritchardjonathank efficientcountingofkmersindnasequencesusingabloomfilter

Efficient counting of k-mers in DNA sequences using a bloom filter

Ejemplares similares