Cargando…

Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies

BACKGROUND: Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assump...

Descripción completa

Detalles Bibliográficos
Autores principales:	Penner, Orion, Grassberger, Peter, Paczuski, Maya
Formato:	Texto
Lenguaje:	English
Publicado:	Public Library of Science 2011
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014950/ https://www.ncbi.nlm.nih.gov/pubmed/21245917 http://dx.doi.org/10.1371/journal.pone.0014373

_version_	1782195431236698112
author	Penner, Orion Grassberger, Peter Paczuski, Maya
author_facet	Penner, Orion Grassberger, Peter Paczuski, Maya
author_sort	Penner, Orion
collection	PubMed
description	BACKGROUND: Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI). MI is, in principle, an objective and model independent similarity measure, but it is not widely used in this context and no algorithm for extracting MI from a given alignment (without assuming an evolutionary model) is known. MI can be estimated without alignments, by concatenating and zipping sequences, but so far this has only produced estimates with uncontrolled errors, despite the fact that the normalized compression distance based on it has shown promising results. RESULTS: We describe a simple approach to get robust estimates of MI from global pairwise alignments. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. For animal mitochondrial DNA our approach uses the alignments made by popular global alignment algorithms to produce MI estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. We point out that, due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. CONCLUSIONS: Several versions of MI based distances outperform conventional distances in distance-based phylogeny. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments. It strongly suggests that information theory concepts can be exploited further in sequence analysis.
format	Text
id	pubmed-3014950
institution	National Center for Biotechnology Information
language	English
publishDate	2011
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-30149502011-01-18 Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies Penner, Orion Grassberger, Peter Paczuski, Maya PLoS One Research Article BACKGROUND: Existing sequence alignment algorithms use heuristic scoring schemes based on biological expertise, which cannot be used as objective distance metrics. As a result one relies on crude measures, like the p- or log-det distances, or makes explicit, and often too simplistic, a priori assumptions about sequence evolution. Information theory provides an alternative, in the form of mutual information (MI). MI is, in principle, an objective and model independent similarity measure, but it is not widely used in this context and no algorithm for extracting MI from a given alignment (without assuming an evolutionary model) is known. MI can be estimated without alignments, by concatenating and zipping sequences, but so far this has only produced estimates with uncontrolled errors, despite the fact that the normalized compression distance based on it has shown promising results. RESULTS: We describe a simple approach to get robust estimates of MI from global pairwise alignments. Our main result uses algorithmic (Kolmogorov) information theory, but we show that similar results can also be obtained from Shannon theory. For animal mitochondrial DNA our approach uses the alignments made by popular global alignment algorithms to produce MI estimates that are strikingly close to estimates obtained from the alignment free methods mentioned above. We point out that, due to the fact that it is not additive, normalized compression distance is not an optimal metric for phylogenetics but we propose a simple modification that overcomes the issue of additivity. We test several versions of our MI based distance measures on a large number of randomly chosen quartets and demonstrate that they all perform better than traditional measures like the Kimura or log-det (resp. paralinear) distances. CONCLUSIONS: Several versions of MI based distances outperform conventional distances in distance-based phylogeny. Even a simplified version based on single letter Shannon entropies, which can be easily incorporated in existing software packages, gave superior results throughout the entire animal kingdom. But we see the main virtue of our approach in a more general way. For example, it can also help to judge the relative merits of different alignment algorithms, by estimating the significance of specific alignments. It strongly suggests that information theory concepts can be exploited further in sequence analysis. Public Library of Science 2011-01-04 /pmc/articles/PMC3014950/ /pubmed/21245917 http://dx.doi.org/10.1371/journal.pone.0014373 Text en Penner et al. http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Penner, Orion Grassberger, Peter Paczuski, Maya Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies
title	Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies
title_full	Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies
title_fullStr	Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies
title_full_unstemmed	Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies
title_short	Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies
title_sort	sequence alignment, mutual information, and dissimilarity measures for constructing phylogenies
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3014950/ https://www.ncbi.nlm.nih.gov/pubmed/21245917 http://dx.doi.org/10.1371/journal.pone.0014373
work_keys_str_mv	AT pennerorion sequencealignmentmutualinformationanddissimilaritymeasuresforconstructingphylogenies AT grassbergerpeter sequencealignmentmutualinformationanddissimilaritymeasuresforconstructingphylogenies AT paczuskimaya sequencealignmentmutualinformationanddissimilaritymeasuresforconstructingphylogenies

Sequence Alignment, Mutual Information, and Dissimilarity Measures for Constructing Phylogenies

Ejemplares similares