Cargando…

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

 : Summary: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, s...

Descripción completa

Detalles Bibliográficos
Autores principales: Balaban, Metin, Bristy, Nishat Anjum, Faisal, Ahnaf, Bayzid, Md Shamsuzzoha, Mirarab, Siavash
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9383262/
https://www.ncbi.nlm.nih.gov/pubmed/35992043
http://dx.doi.org/10.1093/bioadv/vbac055
Descripción
Sumario: : Summary: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes–Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data. AVAILABILITY AND IMPLEMENTATION: Our software is available open source at https://github.com/nishatbristy007/NSB. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.