Cargando…

Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model

 : Summary: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, s...

Descripción completa

Detalles Bibliográficos
Autores principales: Balaban, Metin, Bristy, Nishat Anjum, Faisal, Ahnaf, Bayzid, Md Shamsuzzoha, Mirarab, Siavash
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9383262/
https://www.ncbi.nlm.nih.gov/pubmed/35992043
http://dx.doi.org/10.1093/bioadv/vbac055
_version_ 1784769388670353408
author Balaban, Metin
Bristy, Nishat Anjum
Faisal, Ahnaf
Bayzid, Md Shamsuzzoha
Mirarab, Siavash
author_facet Balaban, Metin
Bristy, Nishat Anjum
Faisal, Ahnaf
Bayzid, Md Shamsuzzoha
Mirarab, Siavash
author_sort Balaban, Metin
collection PubMed
description  : Summary: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes–Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data. AVAILABILITY AND IMPLEMENTATION: Our software is available open source at https://github.com/nishatbristy007/NSB. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online.
format Online
Article
Text
id pubmed-9383262
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-93832622022-08-18 Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model Balaban, Metin Bristy, Nishat Anjum Faisal, Ahnaf Bayzid, Md Shamsuzzoha Mirarab, Siavash Bioinform Adv Original Paper  : Summary: While alignment has been the dominant approach for determining homology prior to phylogenetic inference, alignment-free methods can simplify the analysis, especially when analyzing genome-wide data. Furthermore, alignment-free methods present the only option for emerging forms of data, such as genome skims, which do not permit assembly. Despite the appeal, alignment-free methods have not been competitive with alignment-based methods in terms of accuracy. One limitation of alignment-free methods is their reliance on simplified models of sequence evolution such as Jukes–Cantor. If we can estimate frequencies of base substitutions in an alignment-free setting, we can compute pairwise distances under more complex models. However, since the strand of DNA sequences is unknown for many forms of genome-wide data, which arguably present the best use case for alignment-free methods, the most complex models that one can use are the so-called no strand-bias models. We show how to calculate distances under a four-parameter no strand-bias model called TK4 without relying on alignments or assemblies. The main idea is to replace letters in the input sequences and recompute Jaccard indices between k-mer sets. However, on larger genomes, we also need to compute the number of k-mer mismatches after replacement due to random chance as opposed to homology. We show in simulation that alignment-free distances can be highly accurate when genomes evolve under the assumed models and study the accuracy on assembled and unassembled biological data. AVAILABILITY AND IMPLEMENTATION: Our software is available open source at https://github.com/nishatbristy007/NSB. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics Advances online. Oxford University Press 2022-08-12 /pmc/articles/PMC9383262/ /pubmed/35992043 http://dx.doi.org/10.1093/bioadv/vbac055 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Original Paper
Balaban, Metin
Bristy, Nishat Anjum
Faisal, Ahnaf
Bayzid, Md Shamsuzzoha
Mirarab, Siavash
Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model
title Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model
title_full Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model
title_fullStr Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model
title_full_unstemmed Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model
title_short Genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model
title_sort genome-wide alignment-free phylogenetic distance estimation under a no strand-bias model
topic Original Paper
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9383262/
https://www.ncbi.nlm.nih.gov/pubmed/35992043
http://dx.doi.org/10.1093/bioadv/vbac055
work_keys_str_mv AT balabanmetin genomewidealignmentfreephylogeneticdistanceestimationunderanostrandbiasmodel
AT bristynishatanjum genomewidealignmentfreephylogeneticdistanceestimationunderanostrandbiasmodel
AT faisalahnaf genomewidealignmentfreephylogeneticdistanceestimationunderanostrandbiasmodel
AT bayzidmdshamsuzzoha genomewidealignmentfreephylogeneticdistanceestimationunderanostrandbiasmodel
AT mirarabsiavash genomewidealignmentfreephylogeneticdistanceestimationunderanostrandbiasmodel