Cargando…

Maximum likelihood pandemic-scale phylogenetics

Phylogenetics plays a crucial role in the interpretation of genomic data(1). Phylogenetic analyses of SARS-CoV-2 genomes have allowed the detailed study of the virus’s origins(2), of its international(3,4) and local(4–9) spread, and of the emergence(10) and reproductive success(11) of new variants,...

Descripción completa

Detalles Bibliográficos
Autores principales: De Maio, Nicola, Kalaghatgi, Prabhav, Turakhia, Yatish, Corbett-Detig, Russell, Minh, Bui Quang, Goldman, Nick
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Cold Spring Harbor Laboratory 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8963701/
https://www.ncbi.nlm.nih.gov/pubmed/35350209
http://dx.doi.org/10.1101/2022.03.22.485312
_version_ 1784678047932219392
author De Maio, Nicola
Kalaghatgi, Prabhav
Turakhia, Yatish
Corbett-Detig, Russell
Minh, Bui Quang
Goldman, Nick
author_facet De Maio, Nicola
Kalaghatgi, Prabhav
Turakhia, Yatish
Corbett-Detig, Russell
Minh, Bui Quang
Goldman, Nick
author_sort De Maio, Nicola
collection PubMed
description Phylogenetics plays a crucial role in the interpretation of genomic data(1). Phylogenetic analyses of SARS-CoV-2 genomes have allowed the detailed study of the virus’s origins(2), of its international(3,4) and local(4–9) spread, and of the emergence(10) and reproductive success(11) of new variants, among many applications. These analyses have been enabled by the unparalleled volumes of genome sequence data generated and employed to study and help contain the pandemic(12). However, preferred model-based phylogenetic approaches including maximum likelihood and Bayesian methods, mostly based on Felsenstein’s ‘pruning’ algorithm(13,14), cannot scale to the size of the datasets from the current pandemic(4,15), hampering our understanding of the virus’s evolution and transmission(16). We present new approaches, based on reworking Felsenstein’s algorithm, for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. We exploit near-certainty regarding ancestral genomes, and the similarities between closely related and densely sampled genomes, to greatly reduce computational demands for memory and time. Combined with new methods for searching amongst candidate evolutionary trees, this results in our MAPLE (‘MAximum Parsimonious Likelihood Estimation’) software giving better results than popular approaches such as FastTree 2(17), IQ-TREE 2(18), RAxML-NG(19) and UShER(15). Our approach therefore allows complex and accurate probabilistic phylogenetic analyses of millions of microbial genomes, extending the reach of genomic epidemiology. Future epidemiological datasets are likely to be even larger than those currently associated with COVID-19, and other disciplines such as metagenomics and biodiversity science are also generating huge numbers of genome sequences(20–22). Our methods will permit continued use of preferred likelihood-based phylogenetic analyses.
format Online
Article
Text
id pubmed-8963701
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Cold Spring Harbor Laboratory
record_format MEDLINE/PubMed
spelling pubmed-89637012022-12-15 Maximum likelihood pandemic-scale phylogenetics De Maio, Nicola Kalaghatgi, Prabhav Turakhia, Yatish Corbett-Detig, Russell Minh, Bui Quang Goldman, Nick bioRxiv Article Phylogenetics plays a crucial role in the interpretation of genomic data(1). Phylogenetic analyses of SARS-CoV-2 genomes have allowed the detailed study of the virus’s origins(2), of its international(3,4) and local(4–9) spread, and of the emergence(10) and reproductive success(11) of new variants, among many applications. These analyses have been enabled by the unparalleled volumes of genome sequence data generated and employed to study and help contain the pandemic(12). However, preferred model-based phylogenetic approaches including maximum likelihood and Bayesian methods, mostly based on Felsenstein’s ‘pruning’ algorithm(13,14), cannot scale to the size of the datasets from the current pandemic(4,15), hampering our understanding of the virus’s evolution and transmission(16). We present new approaches, based on reworking Felsenstein’s algorithm, for likelihood-based phylogenetic analysis of epidemiological genomic datasets at unprecedented scales. We exploit near-certainty regarding ancestral genomes, and the similarities between closely related and densely sampled genomes, to greatly reduce computational demands for memory and time. Combined with new methods for searching amongst candidate evolutionary trees, this results in our MAPLE (‘MAximum Parsimonious Likelihood Estimation’) software giving better results than popular approaches such as FastTree 2(17), IQ-TREE 2(18), RAxML-NG(19) and UShER(15). Our approach therefore allows complex and accurate probabilistic phylogenetic analyses of millions of microbial genomes, extending the reach of genomic epidemiology. Future epidemiological datasets are likely to be even larger than those currently associated with COVID-19, and other disciplines such as metagenomics and biodiversity science are also generating huge numbers of genome sequences(20–22). Our methods will permit continued use of preferred likelihood-based phylogenetic analyses. Cold Spring Harbor Laboratory 2022-07-18 /pmc/articles/PMC8963701/ /pubmed/35350209 http://dx.doi.org/10.1101/2022.03.22.485312 Text en https://creativecommons.org/licenses/by-nc-nd/4.0/This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License (https://creativecommons.org/licenses/by-nc-nd/4.0/) , which allows reusers to copy and distribute the material in any medium or format in unadapted form only, for noncommercial purposes only, and only so long as attribution is given to the creator.
spellingShingle Article
De Maio, Nicola
Kalaghatgi, Prabhav
Turakhia, Yatish
Corbett-Detig, Russell
Minh, Bui Quang
Goldman, Nick
Maximum likelihood pandemic-scale phylogenetics
title Maximum likelihood pandemic-scale phylogenetics
title_full Maximum likelihood pandemic-scale phylogenetics
title_fullStr Maximum likelihood pandemic-scale phylogenetics
title_full_unstemmed Maximum likelihood pandemic-scale phylogenetics
title_short Maximum likelihood pandemic-scale phylogenetics
title_sort maximum likelihood pandemic-scale phylogenetics
topic Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8963701/
https://www.ncbi.nlm.nih.gov/pubmed/35350209
http://dx.doi.org/10.1101/2022.03.22.485312
work_keys_str_mv AT demaionicola maximumlikelihoodpandemicscalephylogenetics
AT kalaghatgiprabhav maximumlikelihoodpandemicscalephylogenetics
AT turakhiayatish maximumlikelihoodpandemicscalephylogenetics
AT corbettdetigrussell maximumlikelihoodpandemicscalephylogenetics
AT minhbuiquang maximumlikelihoodpandemicscalephylogenetics
AT goldmannick maximumlikelihoodpandemicscalephylogenetics