Cargando…

Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art

BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph...

Descripción completa

Detalles Bibliográficos
Autores principales:	Manekar, Swati C., Sathe, Shailesh R.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Bentham Science Publishers 2019
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6446480/ https://www.ncbi.nlm.nih.gov/pubmed/31015787 http://dx.doi.org/10.2174/1389202919666181026101326

_version_	1783408371383140352
author	Manekar, Swati C. Sathe, Shailesh R.
author_facet	Manekar, Swati C. Sathe, Shailesh R.
author_sort	Manekar, Swati C.
collection	PubMed
description	BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. OBJECTIVE: In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. METHODS: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. RESULTS: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. CONCLUSION: The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research
format	Online Article Text
id	pubmed-6446480
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	Bentham Science Publishers
record_format	MEDLINE/PubMed
spelling	pubmed-64464802019-07-01 Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art Manekar, Swati C. Sathe, Shailesh R. Curr Genomics Article BACKGROUND: In bioinformatics, estimation of k-mer abundance histograms or just enumerat-ing the number of unique k-mers and the number of singletons are desirable in many genome sequence analysis applications. The applications include predicting genome sizes, data pre-processing for de Bruijn graph assembly methods (tune runtime parameters for analysis tools), repeat detection, sequenc-ing coverage estimation, measuring sequencing error rates, etc. Different methods for cardinality estima-tion in sequencing data have been developed in recent years. OBJECTIVE: In this article, we present a comparative assessment of the different k-mer frequency estima-tion programs (ntCard, KmerGenie, KmerStream and Khmer (abundance-dist-single.py and unique-kmers.py) to assess their relative merits and demerits. METHODS: Principally, the miscounts/error-rates of these tools are analyzed by rigorous experimental analysis for a varied range of k. We also present experimental results on runtime, scalability for larger datasets, memory, CPU utilization as well as parallelism of k-mer frequency estimation methods. RESULTS: The results indicate that ntCard is more accurate in estimating F0, f1 and full k-mer abundance histograms compared with other methods. ntCard is the fastest but it has more memory requirements compared to KmerGenie. CONCLUSION: The results of this evaluation may serve as a roadmap to potential users and practitioners of streaming algorithms for estimating k-mer coverage frequencies, to assist them in identifying an appro-priate method. Such results analysis also help researchers to discover remaining open research ques-tions, effective combinations of existing techniques and possible avenues for future research Bentham Science Publishers 2019-01 2019-01 /pmc/articles/PMC6446480/ /pubmed/31015787 http://dx.doi.org/10.2174/1389202919666181026101326 Text en © 2019 Bentham Science Publishers https://creativecommons.org/licenses/by-nc/4.0/legalcode This is an open access article licensed under the terms of the Creative Commons Attribution-Non-Commercial 4.0 International Public License (CC BY-NC 4.0) (https://creativecommons.org/licenses/by-nc/4.0/legalcode), which permits unrestricted, non-commercial use, distribution and reproduction in any medium, provided the work is properly cited.
spellingShingle	Article Manekar, Swati C. Sathe, Shailesh R. Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title	Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_full	Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_fullStr	Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_full_unstemmed	Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_short	Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art
title_sort	estimating the k-mer coverage frequencies in genomic datasets: a comparative assessment of the state-of-the-art
topic	Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6446480/ https://www.ncbi.nlm.nih.gov/pubmed/31015787 http://dx.doi.org/10.2174/1389202919666181026101326
work_keys_str_mv	AT manekarswatic estimatingthekmercoveragefrequenciesingenomicdatasetsacomparativeassessmentofthestateoftheart AT satheshaileshr estimatingthekmercoveragefrequenciesingenomicdatasetsacomparativeassessmentofthestateoftheart

Estimating the k-mer Coverage Frequencies in Genomic Datasets: A Comparative Assessment of the State-of-the-art

Ejemplares similares