Cargando…

Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

BACKGROUND: A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to sig...

Descripción completa

Detalles Bibliográficos
Autores principales:	Herman, Joseph L, Novák, Ádám, Lyngsø, Rune, Szabó, Adrienn, Miklós, István, Hein, Jotun
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2015
Materias:	Methodology Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4395974/ https://www.ncbi.nlm.nih.gov/pubmed/25888064 http://dx.doi.org/10.1186/s12859-015-0516-1

_version_	1782366524215918592
author	Herman, Joseph L Novák, Ádám Lyngsø, Rune Szabó, Adrienn Miklós, István Hein, Jotun
author_facet	Herman, Joseph L Novák, Ádám Lyngsø, Rune Szabó, Adrienn Miklós, István Hein, Jotun
author_sort	Herman, Joseph L
collection	PubMed
description	BACKGROUND: A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS: In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS: The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0516-1) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4395974
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-43959742015-04-14 Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs Herman, Joseph L Novák, Ádám Lyngsø, Rune Szabó, Adrienn Miklós, István Hein, Jotun BMC Bioinformatics Methodology Article BACKGROUND: A standard procedure in many areas of bioinformatics is to use a single multiple sequence alignment (MSA) as the basis for various types of analysis. However, downstream results may be highly sensitive to the alignment used, and neglecting the uncertainty in the alignment can lead to significant bias in the resulting inference. In recent years, a number of approaches have been developed for probabilistic sampling of alignments, rather than simply generating a single optimum. However, this type of probabilistic information is currently not widely used in the context of downstream inference, since most existing algorithms are set up to make use of a single alignment. RESULTS: In this work we present a framework for representing a set of sampled alignments as a directed acyclic graph (DAG) whose nodes are alignment columns; each path through this DAG then represents a valid alignment. Since the probabilities of individual columns can be estimated from empirical frequencies, this approach enables sample-based estimation of posterior alignment probabilities. Moreover, due to conditional independencies between columns, the graph structure encodes a much larger set of alignments than the original set of sampled MSAs, such that the effective sample size is greatly increased. CONCLUSIONS: The alignment DAG provides a natural way to represent a distribution in the space of MSAs, and allows for existing algorithms to be efficiently scaled up to operate on large sets of alignments. As an example, we show how this can be used to compute marginal probabilities for tree topologies, averaging over a very large number of MSAs. This framework can also be used to generate a statistically meaningful summary alignment; example applications show that this summary alignment is consistently more accurate than the majority of the alignment samples, leading to improvements in downstream tree inference. Implementations of the methods described in this article are available at http://statalign.github.io/WeaveAlign. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-015-0516-1) contains supplementary material, which is available to authorized users. BioMed Central 2015-04-01 /pmc/articles/PMC4395974/ /pubmed/25888064 http://dx.doi.org/10.1186/s12859-015-0516-1 Text en © Herman et al.; licensee BioMed Central. 2015 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Methodology Article Herman, Joseph L Novák, Ádám Lyngsø, Rune Szabó, Adrienn Miklós, István Hein, Jotun Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
title	Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
title_full	Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
title_fullStr	Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
title_full_unstemmed	Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
title_short	Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
title_sort	efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs
topic	Methodology Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4395974/ https://www.ncbi.nlm.nih.gov/pubmed/25888064 http://dx.doi.org/10.1186/s12859-015-0516-1
work_keys_str_mv	AT hermanjosephl efficientrepresentationofuncertaintyinmultiplesequencealignmentsusingdirectedacyclicgraphs AT novakadam efficientrepresentationofuncertaintyinmultiplesequencealignmentsusingdirectedacyclicgraphs AT lyngsørune efficientrepresentationofuncertaintyinmultiplesequencealignmentsusingdirectedacyclicgraphs AT szaboadrienn efficientrepresentationofuncertaintyinmultiplesequencealignmentsusingdirectedacyclicgraphs AT miklosistvan efficientrepresentationofuncertaintyinmultiplesequencealignmentsusingdirectedacyclicgraphs AT heinjotun efficientrepresentationofuncertaintyinmultiplesequencealignmentsusingdirectedacyclicgraphs

Efficient representation of uncertainty in multiple sequence alignments using directed acyclic graphs

Ejemplares similares