Cargando…

KAnalyze: a fast versatile pipelined K-mer toolkit

Motivation: Converting nucleotide sequences into short overlapping fragments of uniform length, k-mers, is a common step in many bioinformatics applications. While existing software packages count k-mers, few are optimized for speed, offer an application programming interface (API), a graphical inte...

Descripción completa

Detalles Bibliográficos
Autores principales: Audano, Peter, Vannberg, Fredrik
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2014
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4080738/
https://www.ncbi.nlm.nih.gov/pubmed/24642064
http://dx.doi.org/10.1093/bioinformatics/btu152
Descripción
Sumario:Motivation: Converting nucleotide sequences into short overlapping fragments of uniform length, k-mers, is a common step in many bioinformatics applications. While existing software packages count k-mers, few are optimized for speed, offer an application programming interface (API), a graphical interface or contain features that make it extensible and maintainable. We designed KAnalyze to compete with the fastest k-mer counters, to produce reliable output and to support future development efforts through well-architected, documented and testable code. Currently, KAnalyze can output k-mer counts in a sorted tab-delimited file or stream k-mers as they are read. KAnalyze can process large datasets with 2 GB of memory. This project is implemented in Java 7, and the command line interface (CLI) is designed to integrate into pipelines written in any language. Results: As a k-mer counter, KAnalyze outperforms Jellyfish, DSK and a pipeline built on Perl and Linux utilities. Through extensive unit and system testing, we have verified that KAnalyze produces the correct k-mer counts over multiple datasets and k-mer sizes. Availability and implementation: KAnalyze is available on SourceForge: https://sourceforge.net/projects/kanalyze/ Contact: fredrik.vannberg@biology.gatech.edu Supplementary information: Supplementary data are available at Bioinformatics online.