Cargando…
Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty
Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likeli...
Autores principales: | , , , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2015
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4558847/ https://www.ncbi.nlm.nih.gov/pubmed/26139831 http://dx.doi.org/10.1093/gbe/evv127 |
_version_ | 1782388680798765056 |
---|---|
author | Md Mukarram Hossain, A.S. Blackburne, Benjamin P. Shah, Abhijeet Whelan, Simon |
author_facet | Md Mukarram Hossain, A.S. Blackburne, Benjamin P. Shah, Abhijeet Whelan, Simon |
author_sort | Md Mukarram Hossain, A.S. |
collection | PubMed |
description | Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likelihood or Bayesian inference, mediated by a probabilistic substitution model that describes sequence change over a tree. The statistical properties of these methods mean that more data directly translates to an increased confidence in downstream results, providing the substitution model is adequate and the MSA is correct. Many studies have investigated the robustness of phylogenetic methods in the presence of substitution model misspecification, but few have examined the statistical properties of those methods when the MSA is unknown. This simulation study examines the statistical properties of the complete two-step process when inferring sequence divergence and the phylogenetic tree topology. Both nucleotide and amino acid analyses are negatively affected by the alignment step, both through inaccurate guide tree estimates and through overfitting to that guide tree. For many alignment tools these effects become more pronounced when additional sequences are added to the analysis. Nucleotide sequences are particularly susceptible, with MSA errors leading to statistical support for long-branch attraction artifacts, which are usually associated with gross substitution model misspecification. Amino acid MSAs are more robust, but do tend to arbitrarily resolve multifurcations in favor of the guide tree. No inference strategies produce consistently accurate estimates of divergence between sequences, although amino acid MSAs are again more accurate than their nucleotide counterparts. We conclude with some practical suggestions about how to limit the effect of MSA uncertainty on evolutionary inference. |
format | Online Article Text |
id | pubmed-4558847 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2015 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-45588472015-09-08 Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty Md Mukarram Hossain, A.S. Blackburne, Benjamin P. Shah, Abhijeet Whelan, Simon Genome Biol Evol Research Article Evolutionary studies usually use a two-step process to investigate sequence data. Step one estimates a multiple sequence alignment (MSA) and step two applies phylogenetic methods to ask evolutionary questions of that MSA. Modern phylogenetic methods infer evolutionary parameters using maximum likelihood or Bayesian inference, mediated by a probabilistic substitution model that describes sequence change over a tree. The statistical properties of these methods mean that more data directly translates to an increased confidence in downstream results, providing the substitution model is adequate and the MSA is correct. Many studies have investigated the robustness of phylogenetic methods in the presence of substitution model misspecification, but few have examined the statistical properties of those methods when the MSA is unknown. This simulation study examines the statistical properties of the complete two-step process when inferring sequence divergence and the phylogenetic tree topology. Both nucleotide and amino acid analyses are negatively affected by the alignment step, both through inaccurate guide tree estimates and through overfitting to that guide tree. For many alignment tools these effects become more pronounced when additional sequences are added to the analysis. Nucleotide sequences are particularly susceptible, with MSA errors leading to statistical support for long-branch attraction artifacts, which are usually associated with gross substitution model misspecification. Amino acid MSAs are more robust, but do tend to arbitrarily resolve multifurcations in favor of the guide tree. No inference strategies produce consistently accurate estimates of divergence between sequences, although amino acid MSAs are again more accurate than their nucleotide counterparts. We conclude with some practical suggestions about how to limit the effect of MSA uncertainty on evolutionary inference. Oxford University Press 2015-07-01 /pmc/articles/PMC4558847/ /pubmed/26139831 http://dx.doi.org/10.1093/gbe/evv127 Text en © The Author(s) 2015. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. http://creativecommons.org/licenses/by/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Md Mukarram Hossain, A.S. Blackburne, Benjamin P. Shah, Abhijeet Whelan, Simon Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty |
title | Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty |
title_full | Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty |
title_fullStr | Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty |
title_full_unstemmed | Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty |
title_short | Evidence of Statistical Inconsistency of Phylogenetic Methods in the Presence of Multiple Sequence Alignment Uncertainty |
title_sort | evidence of statistical inconsistency of phylogenetic methods in the presence of multiple sequence alignment uncertainty |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4558847/ https://www.ncbi.nlm.nih.gov/pubmed/26139831 http://dx.doi.org/10.1093/gbe/evv127 |
work_keys_str_mv | AT mdmukarramhossainas evidenceofstatisticalinconsistencyofphylogeneticmethodsinthepresenceofmultiplesequencealignmentuncertainty AT blackburnebenjaminp evidenceofstatisticalinconsistencyofphylogeneticmethodsinthepresenceofmultiplesequencealignmentuncertainty AT shahabhijeet evidenceofstatisticalinconsistencyofphylogeneticmethodsinthepresenceofmultiplesequencealignmentuncertainty AT whelansimon evidenceofstatisticalinconsistencyofphylogeneticmethodsinthepresenceofmultiplesequencealignmentuncertainty |