Cargando…

TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity

MOTIVATION: Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships...

Descripción completa

Detalles Bibliográficos
Autores principales: Caraballo-Ortiz, Marcos A, Miura, Sayaka, Sanderford, Maxwell, Dolker, Tenzin, Tao, Qiqing, Weaver, Steven, Pond, Sergei L K, Kumar, Sudhir
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2022
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9113349/
https://www.ncbi.nlm.nih.gov/pubmed/35561179
http://dx.doi.org/10.1093/bioinformatics/btac186
_version_ 1784709567130632192
author Caraballo-Ortiz, Marcos A
Miura, Sayaka
Sanderford, Maxwell
Dolker, Tenzin
Tao, Qiqing
Weaver, Steven
Pond, Sergei L K
Kumar, Sudhir
author_facet Caraballo-Ortiz, Marcos A
Miura, Sayaka
Sanderford, Maxwell
Dolker, Tenzin
Tao, Qiqing
Weaver, Steven
Pond, Sergei L K
Kumar, Sudhir
author_sort Caraballo-Ortiz, Marcos A
collection PubMed
description MOTIVATION: Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites but millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features. RESULTS: We present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. We develop a bootstrap strategy that resamples genomes spatiotemporally to assess topological robustness. The application of TopHap to build a phylogeny of 68 057 SARS-CoV-2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million SARS-CoV-2 genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major and recent variants of concern. AVAILABILITY AND IMPLEMENTATION: TopHap is available at https://github.com/SayakaMiura/TopHap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
format Online
Article
Text
id pubmed-9113349
institution National Center for Biotechnology Information
language English
publishDate 2022
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-91133492022-05-18 TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity Caraballo-Ortiz, Marcos A Miura, Sayaka Sanderford, Maxwell Dolker, Tenzin Tao, Qiqing Weaver, Steven Pond, Sergei L K Kumar, Sudhir Bioinformatics Original Papers MOTIVATION: Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites but millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features. RESULTS: We present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. We develop a bootstrap strategy that resamples genomes spatiotemporally to assess topological robustness. The application of TopHap to build a phylogeny of 68 057 SARS-CoV-2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million SARS-CoV-2 genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major and recent variants of concern. AVAILABILITY AND IMPLEMENTATION: TopHap is available at https://github.com/SayakaMiura/TopHap. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online. Oxford University Press 2022-03-24 /pmc/articles/PMC9113349/ /pubmed/35561179 http://dx.doi.org/10.1093/bioinformatics/btac186 Text en © The Author(s) 2022. Published by Oxford University Press. https://creativecommons.org/licenses/by-nc/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License (https://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Original Papers
Caraballo-Ortiz, Marcos A
Miura, Sayaka
Sanderford, Maxwell
Dolker, Tenzin
Tao, Qiqing
Weaver, Steven
Pond, Sergei L K
Kumar, Sudhir
TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity
title TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity
title_full TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity
title_fullStr TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity
title_full_unstemmed TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity
title_short TopHap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity
title_sort tophap: rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity
topic Original Papers
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9113349/
https://www.ncbi.nlm.nih.gov/pubmed/35561179
http://dx.doi.org/10.1093/bioinformatics/btac186
work_keys_str_mv AT caraballoortizmarcosa tophaprapidinferenceofkeyphylogeneticstructuresfromcommonhaplotypesinlargegenomecollectionswithlimiteddiversity
AT miurasayaka tophaprapidinferenceofkeyphylogeneticstructuresfromcommonhaplotypesinlargegenomecollectionswithlimiteddiversity
AT sanderfordmaxwell tophaprapidinferenceofkeyphylogeneticstructuresfromcommonhaplotypesinlargegenomecollectionswithlimiteddiversity
AT dolkertenzin tophaprapidinferenceofkeyphylogeneticstructuresfromcommonhaplotypesinlargegenomecollectionswithlimiteddiversity
AT taoqiqing tophaprapidinferenceofkeyphylogeneticstructuresfromcommonhaplotypesinlargegenomecollectionswithlimiteddiversity
AT weaversteven tophaprapidinferenceofkeyphylogeneticstructuresfromcommonhaplotypesinlargegenomecollectionswithlimiteddiversity
AT pondsergeilk tophaprapidinferenceofkeyphylogeneticstructuresfromcommonhaplotypesinlargegenomecollectionswithlimiteddiversity
AT kumarsudhir tophaprapidinferenceofkeyphylogeneticstructuresfromcommonhaplotypesinlargegenomecollectionswithlimiteddiversity