Cargando…

Efficient inference, potential, and limitations of site-specific substitution models

Natural selection imposes a complex filter on which variants persist in a population resulting in evolutionary patterns that vary greatly along the genome. Some sites evolve close to neutrally, while others are highly conserved, allow only specific states, or only change in concert with other sites....

Descripción completa

Detalles Bibliográficos
Autores principales: Puller, Vadim, Sagulenko, Pavel, Neher, Richard A
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7733610/
https://www.ncbi.nlm.nih.gov/pubmed/33343922
http://dx.doi.org/10.1093/ve/veaa066
_version_ 1783622307666722816
author Puller, Vadim
Sagulenko, Pavel
Neher, Richard A
author_facet Puller, Vadim
Sagulenko, Pavel
Neher, Richard A
author_sort Puller, Vadim
collection PubMed
description Natural selection imposes a complex filter on which variants persist in a population resulting in evolutionary patterns that vary greatly along the genome. Some sites evolve close to neutrally, while others are highly conserved, allow only specific states, or only change in concert with other sites. On one hand, such constraints on sequence evolution can be to infer biological function, one the other hand they need to be accounted for in phylogenetic reconstruction. Phylogenetic models often account for this complexity by partitioning sites into a small number of discrete classes with different rates and/or state preferences. Appropriate model complexity is typically determined by model selection procedures. Here, we present an efficient algorithm to estimate more complex models that allow for different preferences at every site and explore the accuracy at which such models can be estimated from simulated data. Our iterative approximate maximum likelihood scheme uses information in the data efficiently and accurately estimates site-specific preferences from large data sets with moderately diverged sequences and known topology. However, the joint estimation of site-specific rates, and site-specific preferences, and phylogenetic branch length can suffer from identifiability problems, while ignoring variation in preferences across sites results in branch length underestimates. Site-specific preferences estimated from large HIV pol alignments show qualitative concordance with intra-host estimates of fitness costs. Analysis of these substitution models suggests near saturation of divergence after a few hundred years. Such saturation can explain the inability to infer deep divergence times of HIV and SIVs using molecular clock approaches and time-dependent rate estimates.
format Online
Article
Text
id pubmed-7733610
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-77336102020-12-17 Efficient inference, potential, and limitations of site-specific substitution models Puller, Vadim Sagulenko, Pavel Neher, Richard A Virus Evol Research Article Natural selection imposes a complex filter on which variants persist in a population resulting in evolutionary patterns that vary greatly along the genome. Some sites evolve close to neutrally, while others are highly conserved, allow only specific states, or only change in concert with other sites. On one hand, such constraints on sequence evolution can be to infer biological function, one the other hand they need to be accounted for in phylogenetic reconstruction. Phylogenetic models often account for this complexity by partitioning sites into a small number of discrete classes with different rates and/or state preferences. Appropriate model complexity is typically determined by model selection procedures. Here, we present an efficient algorithm to estimate more complex models that allow for different preferences at every site and explore the accuracy at which such models can be estimated from simulated data. Our iterative approximate maximum likelihood scheme uses information in the data efficiently and accurately estimates site-specific preferences from large data sets with moderately diverged sequences and known topology. However, the joint estimation of site-specific rates, and site-specific preferences, and phylogenetic branch length can suffer from identifiability problems, while ignoring variation in preferences across sites results in branch length underestimates. Site-specific preferences estimated from large HIV pol alignments show qualitative concordance with intra-host estimates of fitness costs. Analysis of these substitution models suggests near saturation of divergence after a few hundred years. Such saturation can explain the inability to infer deep divergence times of HIV and SIVs using molecular clock approaches and time-dependent rate estimates. Oxford University Press 2020-08-20 /pmc/articles/PMC7733610/ /pubmed/33343922 http://dx.doi.org/10.1093/ve/veaa066 Text en © The Author(s) 2020. Published by Oxford University Press. https://creativecommons.org/licenses/by/4.0/This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/ (https://creativecommons.org/licenses/by/4.0/) ), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Puller, Vadim
Sagulenko, Pavel
Neher, Richard A
Efficient inference, potential, and limitations of site-specific substitution models
title Efficient inference, potential, and limitations of site-specific substitution models
title_full Efficient inference, potential, and limitations of site-specific substitution models
title_fullStr Efficient inference, potential, and limitations of site-specific substitution models
title_full_unstemmed Efficient inference, potential, and limitations of site-specific substitution models
title_short Efficient inference, potential, and limitations of site-specific substitution models
title_sort efficient inference, potential, and limitations of site-specific substitution models
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7733610/
https://www.ncbi.nlm.nih.gov/pubmed/33343922
http://dx.doi.org/10.1093/ve/veaa066
work_keys_str_mv AT pullervadim efficientinferencepotentialandlimitationsofsitespecificsubstitutionmodels
AT sagulenkopavel efficientinferencepotentialandlimitationsofsitespecificsubstitutionmodels
AT neherricharda efficientinferencepotentialandlimitationsofsitespecificsubstitutionmodels