Cargando…

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

BACKGROUND: Reconstruction of multiple sequence alignments (MSAs) is a crucial step in most homology-based sequence analyses, which constitute an integral part of computational biology. To improve the accuracy of this crucial step, it is essential to better characterize errors that state-of-the-art...

Descripción completa

Detalles Bibliográficos
Autor principal:	Ezawa, Kiyoshi
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	BioMed Central 2016
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4799563/ https://www.ncbi.nlm.nih.gov/pubmed/26992851 http://dx.doi.org/10.1186/s12859-016-0945-5

_version_	1782422372494606336
author	Ezawa, Kiyoshi
author_facet	Ezawa, Kiyoshi
author_sort	Ezawa, Kiyoshi
collection	PubMed
description	BACKGROUND: Reconstruction of multiple sequence alignments (MSAs) is a crucial step in most homology-based sequence analyses, which constitute an integral part of computational biology. To improve the accuracy of this crucial step, it is essential to better characterize errors that state-of-the-art aligners typically make. For this purpose, we here introduce two tools: the complete-likelihood score and the position-shift map. RESULTS: The logarithm of the total probability of a MSA under a stochastic model of sequence evolution along a time axis via substitutions, insertions and deletions (called the “complete-likelihood score” here) can serve as an ideal score of the MSA. A position-shift map, which maps the difference in each residue’s position between two MSAs onto one of them, can clearly visualize where and how MSA errors occurred and help disentangle composite errors. To characterize MSA errors using these tools, we constructed three sets of simulated MSAs of selectively neutral mammalian DNA sequences, with small, moderate and large divergences, under a stochastic evolutionary model with an empirically common power-law insertion/deletion length distribution. Then, we reconstructed MSAs using MAFFT and Prank as representative state-of-the-art single-optimum-search aligners. About 40–99% of the hundreds of thousands of gapped segments were involved in alignment errors. In a substantial fraction, from about 1/4 to over 3/4, of erroneously reconstructed segments, reconstructed MSAs by each aligner showed complete-likelihood scores not lower than those of the true MSAs. Out of the remaining errors, a majority by an iterative option of MAFFT showed discrepancies between the aligner-specific score and the complete-likelihood score, and a majority by Prank seemed due to inadequate exploration of the MSA space. Analyses by position-shift maps indicated that true MSAs are in considerable neighborhoods of reconstructed MSAs in about 80–99% of the erroneous segments for small and moderate divergences, but in only a minority for large divergences. CONCLUSIONS: The results of this study suggest that measures to further improve the accuracy of reconstructed MSAs would substantially differ depending on the types of aligners. They also re-emphasize the importance of obtaining a probability distribution of fairly likely MSAs, instead of just searching for a single optimum MSA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0945-5) contains supplementary material, which is available to authorized users.
format	Online Article Text
id	pubmed-4799563
institution	National Center for Biotechnology Information
language	English
publishDate	2016
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-47995632016-03-20 Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map Ezawa, Kiyoshi BMC Bioinformatics Research Article BACKGROUND: Reconstruction of multiple sequence alignments (MSAs) is a crucial step in most homology-based sequence analyses, which constitute an integral part of computational biology. To improve the accuracy of this crucial step, it is essential to better characterize errors that state-of-the-art aligners typically make. For this purpose, we here introduce two tools: the complete-likelihood score and the position-shift map. RESULTS: The logarithm of the total probability of a MSA under a stochastic model of sequence evolution along a time axis via substitutions, insertions and deletions (called the “complete-likelihood score” here) can serve as an ideal score of the MSA. A position-shift map, which maps the difference in each residue’s position between two MSAs onto one of them, can clearly visualize where and how MSA errors occurred and help disentangle composite errors. To characterize MSA errors using these tools, we constructed three sets of simulated MSAs of selectively neutral mammalian DNA sequences, with small, moderate and large divergences, under a stochastic evolutionary model with an empirically common power-law insertion/deletion length distribution. Then, we reconstructed MSAs using MAFFT and Prank as representative state-of-the-art single-optimum-search aligners. About 40–99% of the hundreds of thousands of gapped segments were involved in alignment errors. In a substantial fraction, from about 1/4 to over 3/4, of erroneously reconstructed segments, reconstructed MSAs by each aligner showed complete-likelihood scores not lower than those of the true MSAs. Out of the remaining errors, a majority by an iterative option of MAFFT showed discrepancies between the aligner-specific score and the complete-likelihood score, and a majority by Prank seemed due to inadequate exploration of the MSA space. Analyses by position-shift maps indicated that true MSAs are in considerable neighborhoods of reconstructed MSAs in about 80–99% of the erroneous segments for small and moderate divergences, but in only a minority for large divergences. CONCLUSIONS: The results of this study suggest that measures to further improve the accuracy of reconstructed MSAs would substantially differ depending on the types of aligners. They also re-emphasize the importance of obtaining a probability distribution of fairly likely MSAs, instead of just searching for a single optimum MSA. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1186/s12859-016-0945-5) contains supplementary material, which is available to authorized users. BioMed Central 2016-03-18 /pmc/articles/PMC4799563/ /pubmed/26992851 http://dx.doi.org/10.1186/s12859-016-0945-5 Text en © Ezawa. 2016 Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
spellingShingle	Research Article Ezawa, Kiyoshi Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map
title	Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map
title_full	Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map
title_fullStr	Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map
title_full_unstemmed	Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map
title_short	Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map
title_sort	characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4799563/ https://www.ncbi.nlm.nih.gov/pubmed/26992851 http://dx.doi.org/10.1186/s12859-016-0945-5
work_keys_str_mv	AT ezawakiyoshi characterizationofmultiplesequencealignmenterrorsusingcompletelikelihoodscoreandpositionshiftmap

Characterization of multiple sequence alignment errors using complete-likelihood score and position-shift map

Ejemplares similares