Cargando…

ntCard: a streaming algorithm for cardinality estimation in genomics data

MOTIVATION: Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k-mers, or even be...

Descripción completa

Detalles Bibliográficos
Autores principales: Mohamadi, Hamid, Khan, Hamza, Birol, Inanc
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2017
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408799/
https://www.ncbi.nlm.nih.gov/pubmed/28453674
http://dx.doi.org/10.1093/bioinformatics/btw832
_version_ 1783232367129788416
author Mohamadi, Hamid
Khan, Hamza
Birol, Inanc
author_facet Mohamadi, Hamid
Khan, Hamza
Birol, Inanc
author_sort Mohamadi, Hamid
collection PubMed
description MOTIVATION: Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k-mers, or even better, to build a histogram of k-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k-mer histogram from large volumes of sequencing data is a challenging task. RESULTS: Here, we present ntCard, a streaming algorithm for estimating the frequencies of k-mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k-mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications. AVAILABILITY AND IMPLEMENTATION: ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-5408799
institution National Center for Biotechnology Information
language English
publishDate 2017
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-54087992017-05-03 ntCard: a streaming algorithm for cardinality estimation in genomics data Mohamadi, Hamid Khan, Hamza Birol, Inanc Bioinformatics Original Papers MOTIVATION: Many bioinformatics algorithms are designed for the analysis of sequences of some uniform length, conventionally referred to as k-mers. These include de Bruijn graph assembly methods and sequence alignment tools. An efficient algorithm to enumerate the number of unique k-mers, or even better, to build a histogram of k-mer frequencies would be desirable for these tools and their downstream analysis pipelines. Among other applications, estimated frequencies can be used to predict genome sizes, measure sequencing error rates, and tune runtime parameters for analysis tools. However, calculating a k-mer histogram from large volumes of sequencing data is a challenging task. RESULTS: Here, we present ntCard, a streaming algorithm for estimating the frequencies of k-mers in genomics datasets. At its core, ntCard uses the ntHash algorithm to efficiently compute hash values for streamed sequences. It then samples the calculated hash values to build a reduced representation multiplicity table describing the sample distribution. Finally, it uses a statistical model to reconstruct the population distribution from the sample distribution. We have compared the performance of ntCard and other cardinality estimation algorithms. We used three datasets of 480 GB, 500 GB and 2.4 TB in size, where the first two representing whole genome shotgun sequencing experiments on the human genome and the last one on the white spruce genome. Results show ntCard estimates k-mer coverage frequencies >15× faster than the state-of-the-art algorithms, using similar amount of memory, and with higher accuracy rates. Thus, our benchmarks demonstrate ntCard as a potentially enabling technology for large-scale genomics applications. AVAILABILITY AND IMPLEMENTATION: ntCard is written in C ++ and is released under the GPL license. It is freely available at https://github.com/bcgsc/ntCard. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2017-05-01 2017-01-05 /pmc/articles/PMC5408799/ /pubmed/28453674 http://dx.doi.org/10.1093/bioinformatics/btw832 Text en © The Author 2017. Published by Oxford University Press. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Mohamadi, Hamid
Khan, Hamza
Birol, Inanc
ntCard: a streaming algorithm for cardinality estimation in genomics data
title ntCard: a streaming algorithm for cardinality estimation in genomics data
title_full ntCard: a streaming algorithm for cardinality estimation in genomics data
title_fullStr ntCard: a streaming algorithm for cardinality estimation in genomics data
title_full_unstemmed ntCard: a streaming algorithm for cardinality estimation in genomics data
title_short ntCard: a streaming algorithm for cardinality estimation in genomics data
title_sort ntcard: a streaming algorithm for cardinality estimation in genomics data
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5408799/
https://www.ncbi.nlm.nih.gov/pubmed/28453674
http://dx.doi.org/10.1093/bioinformatics/btw832
work_keys_str_mv AT mohamadihamid ntcardastreamingalgorithmforcardinalityestimationingenomicsdata
AT khanhamza ntcardastreamingalgorithmforcardinalityestimationingenomicsdata
AT birolinanc ntcardastreamingalgorithmforcardinalityestimationingenomicsdata