Cargando…
Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Bentham Science Publishers
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6446480/ https://www.ncbi.nlm.nih.gov/pubmed/31015787 http://dx.doi.org/10.2174/1389202919666181026101326 |
_version_ | 1783408371383140352 |
---|---|
author | Manekar, Swati C. Sathe, Shailesh R. |
author_facet | Manekar, Swati C. Sathe, Shailesh R. |
author_sort | Manekar, Swati C. |
collection | PubMed |
description | BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. OBJECTIVE: In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. METHODS: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. RESULTS: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. CONCLUSION: The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research |
format | Online Article Text |
id | pubmed-6446480 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | Bentham Science Publishers |
record_format | MEDLINE/PubMed |
spelling | pubmed-64464802019-07-01 Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art Manekar, Swati C. Sathe, Shailesh R. Curr Genomics Article BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. OBJECTIVE: In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. METHODS: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. RESULTS: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. CONCLUSION: The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research Bentham Science Publishers 2019-01 2019-01 /pmc/articles/PMC6446480/ /pubmed/31015787 http://dx.doi.org/10.2174/1389202919666181026101326 Text en © 2019 Bentham Science Publishers https://creativecommons.org/licenses/by-nc/4.0/legalcode This is an open access article licensed under the terms of the Creative Commons Attribution-Non-Commercial 4.0 International Public License (CC BY-NC 4.0) (https://creativecommons.org/licenses/by-nc/4.0/legalcode), which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited. |
spellingShingle | Article Manekar, Swati C. Sathe, Shailesh R. Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art |
title | Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art |
title_full | Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art |
title_fullStr | Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art |
title_full_unstemmed | Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art |
title_short | Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art |
title_sort | estimating the k-mer coverage frequencies in genomic datasets: a comparative assessment of the state-of-the-art |
topic | Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6446480/ https://www.ncbi.nlm.nih.gov/pubmed/31015787 http://dx.doi.org/10.2174/1389202919666181026101326 |
work_keys_str_mv | AT manekarswatic estimatingthekmercoveragefrequenciesingenomicdatasetsacomparativeassessmentofthestateoftheart AT satheshaileshr estimatingthekmercoveragefrequenciesingenomicdatasetsacomparativeassessmentofthestateoftheart |