Cargando…

Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty

Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likeli...

Descripción completa

Detalles Bibliográficos
Autores principales: Md Mukarram Hossain, A.S., Blackburne, Benjamin P., Shah, Abhijeet, Whelan, Simon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2015
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4558847/
https://www.ncbi.nlm.nih.gov/pubmed/26139831
http://dx.doi.org/10.1093/gbe/evv127
_version_ 1782388680798765056
author Md Mukarram Hossain, A.S.
Blackburne, Benjamin P.
Shah, Abhijeet
Whelan, Simon
author_facet Md Mukarram Hossain, A.S.
Blackburne, Benjamin P.
Shah, Abhijeet
Whelan, Simon
author_sort Md Mukarram Hossain, A.S.
collection PubMed
description Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likelihood or Bayesian inference, mediated by a probabilistic substitution model that describes sequence change over a tree. The statistical properties of these methods mean that more data directly translates to an increased confidence in downstream results, providing the substitution model is adequate and the MSA is correct. Many studies have investigated the robustness of phylogenetic methods in the presence of substitution model misspecification, but few have examined the statistical properties of those methods when the MSA is unknown. This simulation study examines the statistical properties of the complete two-step process when inferring sequence divergence and the phylogenetic tree topology. Both nucleotide and amino acid analyses are negatively affected by the alignment step, both through inaccurate guide tree estimates and through overfitting to that guide tree. For many alignment tools these effects become more pronounced when additional sequences are added to the analysis. Nucleotide sequences are particularly susceptible, with MSA errors leading to statistical support for long-branch attraction artifacts, which are usually associated with gross substitution model misspecification. Amino acid MSAs are more robust, but do tend to arbitrarily resolve multifurcations in favor of the guide tree. No inference strategies produce consistently accurate estimates of divergence between sequences, although amino acid MSAs are again more accurate than their nucleotide counterparts. We conclude with some practical suggestions about how to limit the effect of MSA uncertainty on evolutionary inference.
format Online
Article
Text
id pubmed-4558847
institution National Center for Biotechnology Information
language English
publishDate 2015
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-45588472015-09-08 Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty Md Mukarram Hossain, A.S. Blackburne, Benjamin P. Shah, Abhijeet Whelan, Simon Genome Biol Evol Research Article Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likelihood or Bayesian inference, mediated by a probabilistic substitution model that describes sequence change over a tree. The statistical properties of these methods mean that more data directly translates to an increased confidence in downstream results, providing the substitution model is adequate and the MSA is correct. Many studies have investigated the robustness of phylogenetic methods in the presence of substitution model misspecification, but few have examined the statistical properties of those methods when the MSA is unknown. This simulation study examines the statistical properties of the complete two-step process when inferring sequence divergence and the phylogenetic tree topology. Both nucleotide and amino acid analyses are negatively affected by the alignment step, both through inaccurate guide tree estimates and through overfitting to that guide tree. For many alignment tools these effects become more pronounced when additional sequences are added to the analysis. Nucleotide sequences are particularly susceptible, with MSA errors leading to statistical support for long-branch attraction artifacts, which are usually associated with gross substitution model misspecification. Amino acid MSAs are more robust, but do tend to arbitrarily resolve multifurcations in favor of the guide tree. No inference strategies produce consistently accurate estimates of divergence between sequences, although amino acid MSAs are again more accurate than their nucleotide counterparts. We conclude with some practical suggestions about how to limit the effect of MSA uncertainty on evolutionary inference. Oxford University Press 2015-07-01 /pmc/articles/PMC4558847/ /pubmed/26139831 http://dx.doi.org/10.1093/gbe/evv127 Text en © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Md Mukarram Hossain, A.S.
Blackburne, Benjamin P.
Shah, Abhijeet
Whelan, Simon
Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty
title Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty
title_full Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty
title_fullStr Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty
title_full_unstemmed Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty
title_short Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty
title_sort evidence of statistical inconsistency of phylogenetic methods in the presence of multiple sequence alignment uncertainty
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4558847/
https://www.ncbi.nlm.nih.gov/pubmed/26139831
http://dx.doi.org/10.1093/gbe/evv127
work_keys_str_mv AT mdmukarramhossainas evidenceofstatisticalinconsistencyofphylogeneticmethodsinthepresenceofmultiplesequencealignmentuncertainty
AT blackburnebenjaminp evidenceofstatisticalinconsistencyofphylogeneticmethodsinthepresenceofmultiplesequencealignmentuncertainty
AT shahabhijeet evidenceofstatisticalinconsistencyofphylogeneticmethodsinthepresenceofmultiplesequencealignmentuncertainty
AT whelansimon evidenceofstatisticalinconsistencyofphylogeneticmethodsinthepresenceofmultiplesequencealignmentuncertainty