Cargando…

Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments

Multiple sequence alignment (MSA) is ubiquitous in evolution and bioinformatics. MSAs are usually taken to be a known and fixed quantity on which to perform downstream analysis despite extensive evidence that MSA accuracy and uncertainty affect results. These errors are known to cause a wide range o...

Descripción completa

Detalles Bibliográficos
Autores principales: Ali, Raja Hashim, Bogusz, Marcin, Whelan, Simon
Formato: Online Artículo Texto
Lenguaje:English
Publicado: Oxford University Press 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6933875/
https://www.ncbi.nlm.nih.gov/pubmed/31209473
http://dx.doi.org/10.1093/molbev/msz142
_version_ 1783483293768876032
author Ali, Raja Hashim
Bogusz, Marcin
Whelan, Simon
author_facet Ali, Raja Hashim
Bogusz, Marcin
Whelan, Simon
author_sort Ali, Raja Hashim
collection PubMed
description Multiple sequence alignment (MSA) is ubiquitous in evolution and bioinformatics. MSAs are usually taken to be a known and fixed quantity on which to perform downstream analysis despite extensive evidence that MSA accuracy and uncertainty affect results. These errors are known to cause a wide range of problems for downstream evolutionary inference, ranging from false inference of positive selection to long branch attraction artifacts. The most popular approach to dealing with this problem is to remove (filter) specific columns in the MSA that are thought to be prone to error. Although popular, this approach has had mixed success and several studies have even suggested that filtering might be detrimental to phylogenetic studies. We present a graph-based clustering method to address MSA uncertainty and error in the software Divvier (available at https://github.com/simonwhelan/Divvier), which uses a probabilistic model to identify clusters of characters that have strong statistical evidence of shared homology. These clusters can then be used to either filter characters from the MSA (partial filtering) or represent each of the clusters in a new column (divvying). We validate Divvier through its performance on real and simulated benchmarks, finding Divvier substantially outperforms existing filtering software by retaining more true pairwise homologies calls and removing more false positive pairwise homologies. We also find that Divvier, in contrast to other filtering tools, can alleviate long branch attraction artifacts induced by MSA and reduces the variation in tree estimates caused by MSA uncertainty.
format Online
Article
Text
id pubmed-6933875
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-69338752019-12-30 Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments Ali, Raja Hashim Bogusz, Marcin Whelan, Simon Mol Biol Evol Methods Multiple sequence alignment (MSA) is ubiquitous in evolution and bioinformatics. MSAs are usually taken to be a known and fixed quantity on which to perform downstream analysis despite extensive evidence that MSA accuracy and uncertainty affect results. These errors are known to cause a wide range of problems for downstream evolutionary inference, ranging from false inference of positive selection to long branch attraction artifacts. The most popular approach to dealing with this problem is to remove (filter) specific columns in the MSA that are thought to be prone to error. Although popular, this approach has had mixed success and several studies have even suggested that filtering might be detrimental to phylogenetic studies. We present a graph-based clustering method to address MSA uncertainty and error in the software Divvier (available at https://github.com/simonwhelan/Divvier), which uses a probabilistic model to identify clusters of characters that have strong statistical evidence of shared homology. These clusters can then be used to either filter characters from the MSA (partial filtering) or represent each of the clusters in a new column (divvying). We validate Divvier through its performance on real and simulated benchmarks, finding Divvier substantially outperforms existing filtering software by retaining more true pairwise homologies calls and removing more false positive pairwise homologies. We also find that Divvier, in contrast to other filtering tools, can alleviate long branch attraction artifacts induced by MSA and reduces the variation in tree estimates caused by MSA uncertainty. Oxford University Press 2019-10 2019-06-18 /pmc/articles/PMC6933875/ /pubmed/31209473 http://dx.doi.org/10.1093/molbev/msz142 Text en © The Author(s) 2019. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. http://creativecommons.org/licenses/by-nc/4.0/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
spellingShingle Methods
Ali, Raja Hashim
Bogusz, Marcin
Whelan, Simon
Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments
title Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments
title_full Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments
title_fullStr Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments
title_full_unstemmed Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments
title_short Identifying Clusters of High Confidence Homologies in Multiple Sequence Alignments
title_sort identifying clusters of high confidence homologies in multiple sequence alignments
topic Methods
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6933875/
https://www.ncbi.nlm.nih.gov/pubmed/31209473
http://dx.doi.org/10.1093/molbev/msz142
work_keys_str_mv AT alirajahashim identifyingclustersofhighconfidencehomologiesinmultiplesequencealignments
AT boguszmarcin identifyingclustersofhighconfidencehomologiesinmultiplesequencealignments
AT whelansimon identifyingclustersofhighconfidencehomologiesinmultiplesequencealignments