Cargando…

A LASSO-based approach to sample sites for phylogenetic tree search

MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phyloge...

Descripción completa

Detalles Bibliográficos
Autores principales: Ecker, Noa, Azouri, Dana, Bettisworth, Ben, Stamatakis, Alexandros, Mansour, Yishay, Mayrose, Itay, Pupko, Tal
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9236582/
https://www.ncbi.nlm.nih.gov/pubmed/35758778
http://dx.doi.org/10.1093/bioinformatics/btac252
_version_ 1784736565945171968
author Ecker, Noa
Azouri, Dana
Bettisworth, Ben
Stamatakis, Alexandros
Mansour, Yishay
Mayrose, Itay
Pupko, Tal
author_facet Ecker, Noa
Azouri, Dana
Bettisworth, Ben
Stamatakis, Alexandros
Mansour, Yishay
Mayrose, Itay
Pupko, Tal
author_sort Ecker, Noa
collection PubMed
description MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. RESULTS: Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. AVAILABILITY AND IMPLEMENTATION: The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9236582
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-92365822022-06-29 A LASSO-based approach to sample sites for phylogenetic tree search Ecker, Noa Azouri, Dana Bettisworth, Ben Stamatakis, Alexandros Mansour, Yishay Mayrose, Itay Pupko, Tal Bioinformatics ISCB/Ismb 2022 MOTIVATION: In recent years, full-genome sequences have become increasingly available and as a result many modern phylogenetic analyses are based on very long sequences, often with over 100 000 sites. Phylogenetic reconstructions of large-scale alignments are challenging for likelihood-based phylogenetic inference programs and usually require using a powerful computer cluster. Current tools for alignment trimming prior to phylogenetic analysis do not promise a significant reduction in the alignment size and are claimed to have a negative effect on the accuracy of the obtained tree. RESULTS: Here, we propose an artificial-intelligence-based approach, which provides means to select the optimal subset of sites and a formula by which one can compute the log-likelihood of the entire data based on this subset. Our approach is based on training a regularized Lasso-regression model that optimizes the log-likelihood prediction accuracy while putting a constraint on the number of sites used for the approximation. We show that computing the likelihood based on 5% of the sites already provides accurate approximation of the tree likelihood based on the entire data. Furthermore, we show that using this Lasso-based approximation during a tree search decreased running-time substantially while retaining the same tree-search performance. AVAILABILITY AND IMPLEMENTATION: The code was implemented in Python version 3.8 and is available through GitHub (https://github.com/noaeker/lasso_positions_sampling). The datasets used in this paper were retrieved from Zhou et al. (2018) as described in section 3. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-06-27 /pmc/articles/PMC9236582/ /pubmed/35758778 http://dx.doi.org/10.1093/bioinformatics/btac252 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle ISCB/Ismb 2022
Ecker, Noa
Azouri, Dana
Bettisworth, Ben
Stamatakis, Alexandros
Mansour, Yishay
Mayrose, Itay
Pupko, Tal
A LASSO-based approach to sample sites for phylogenetic tree search
title A LASSO-based approach to sample sites for phylogenetic tree search
title_full A LASSO-based approach to sample sites for phylogenetic tree search
title_fullStr A LASSO-based approach to sample sites for phylogenetic tree search
title_full_unstemmed A LASSO-based approach to sample sites for phylogenetic tree search
title_short A LASSO-based approach to sample sites for phylogenetic tree search
title_sort lasso-based approach to sample sites for phylogenetic tree search
topic ISCB/Ismb 2022
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9236582/
https://www.ncbi.nlm.nih.gov/pubmed/35758778
http://dx.doi.org/10.1093/bioinformatics/btac252
work_keys_str_mv AT eckernoa alassobasedapproachtosamplesitesforphylogenetictreesearch
AT azouridana alassobasedapproachtosamplesitesforphylogenetictreesearch
AT bettisworthben alassobasedapproachtosamplesitesforphylogenetictreesearch
AT stamatakisalexandros alassobasedapproachtosamplesitesforphylogenetictreesearch
AT mansouryishay alassobasedapproachtosamplesitesforphylogenetictreesearch
AT mayroseitay alassobasedapproachtosamplesitesforphylogenetictreesearch
AT pupkotal alassobasedapproachtosamplesitesforphylogenetictreesearch
AT eckernoa lassobasedapproachtosamplesitesforphylogenetictreesearch
AT azouridana lassobasedapproachtosamplesitesforphylogenetictreesearch
AT bettisworthben lassobasedapproachtosamplesitesforphylogenetictreesearch
AT stamatakisalexandros lassobasedapproachtosamplesitesforphylogenetictreesearch
AT mansouryishay lassobasedapproachtosamplesitesforphylogenetictreesearch
AT mayroseitay lassobasedapproachtosamplesitesforphylogenetictreesearch
AT pupkotal lassobasedapproachtosamplesitesforphylogenetictreesearch