Cargando…

Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art

BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph...

Descripción completa

Detalles Bibliográficos
Autores principales: Manekar, Swati C., Sathe, Shailesh R.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Bentham Science Publishers 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6446480/
https://www.ncbi.nlm.nih.gov/pubmed/31015787
http://dx.doi.org/10.2174/1389202919666181026101326
_version_ 1783408371383140352
author Manekar, Swati C.
Sathe, Shailesh R.
author_facet Manekar, Swati C.
Sathe, Shailesh R.
author_sort Manekar, Swati C.
collection PubMed
description BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. OBJECTIVE: In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. METHODS: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. RESULTS: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. CONCLUSION: The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research
format Online
Article
Text
id pubmed-6446480
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Bentham Science Publishers
record_format MEDLINE/PubMed
spelling pubmed-64464802019-07-01 Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art Manekar, Swati C. Sathe, Shailesh R. Curr Genomics Article BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. OBJECTIVE: In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. METHODS: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. RESULTS: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. CONCLUSION: The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research Bentham Science Publishers 2019-01 2019-01 /pmc/articles/PMC6446480/ /pubmed/31015787 http://dx.doi.org/10.2174/1389202919666181026101326 Text en © 2019 Bentham Science Publishers https://creativecommons.org/licenses/by-nc/4.0/legalcode This is an open access article licensed under the terms of the Creative Commons Attribution-Non-Commercial 4.0 International Public License (CC BY-NC 4.0) (https://creativecommons.org/licenses/by-nc/4.0/legalcode), which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited.
spellingShingle Article
Manekar, Swati C.
Sathe, Shailesh R.
Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_full Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_fullStr Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_full_unstemmed Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_short Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_sort estimating the k-mer coverage frequencies in genomic datasets: a comparative assessment of the state-of-the-art
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6446480/
https://www.ncbi.nlm.nih.gov/pubmed/31015787
http://dx.doi.org/10.2174/1389202919666181026101326
work_keys_str_mv AT manekarswatic estimatingthekmercoveragefrequenciesingenomicdatasetsacomparativeassessmentofthestateoftheart
AT satheshaileshr estimatingthekmercoveragefrequenciesingenomicdatasetsacomparativeassessmentofthestateoftheart