Cargando…

PaSiT: a novel approach based on short-oligonucleotide frequencies for efficient bacterial identification and typing

MOTIVATION: One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-ba...

Descripción completa

Detalles Bibliográficos
Autores principales: Goussarov, Gleb, Cleenwerck, Ilse, Mysara, Mohamed, Leys, Natalie, Monsieurs, Pieter, Tahon, Guillaume, Carlier, Aurélien, Vandamme, Peter, Van Houdt, Rob
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7178395/
https://www.ncbi.nlm.nih.gov/pubmed/31899493
http://dx.doi.org/10.1093/bioinformatics/btz964
Descripción
Sumario:MOTIVATION: One of the most widespread methods used in taxonomy studies to distinguish between strains or taxa is the calculation of average nucleotide identity. It requires a computationally expensive alignment step and is therefore not suitable for large-scale comparisons. Short oligonucleotide-based methods do offer a faster alternative but at the expense of accuracy. Here, we aim to address this shortcoming by providing a software that implements a novel method based on short-oligonucleotide frequencies to compute inter-genomic distances. RESULTS: Our tetranucleotide and hexanucleotide implementations, which were optimized based on a taxonomically well-defined set of over 200 newly sequenced bacterial genomes, are as accurate as the short oligonucleotide-based method TETRA and average nucleotide identity, for identifying bacterial species and strains, respectively. Moreover, the lightweight nature of this method makes it applicable for large-scale analyses. AVAILABILITY AND IMPLEMENTATION: The method introduced here was implemented, together with other existing methods, in a dependency-free software written in C, GenDisCal, available as source code from https://github.com/LM-UGent/GenDisCal. The software supports multithreading and has been tested on Windows and Linux (CentOS). In addition, a Java-based graphical user interface that acts as a wrapper for the software is also available. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.