Cargando…

Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments

BACKGROUND: While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferr...

Descripción completa

Detalles Bibliográficos
Autores principales: Sierk, Michael L, Smoot, Michael E, Bass, Ellen J, Pearson, William R
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2010
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2850363/
https://www.ncbi.nlm.nih.gov/pubmed/20307279
http://dx.doi.org/10.1186/1471-2105-11-146
_version_ 1782179777870823424
author Sierk, Michael L
Smoot, Michael E
Bass, Ellen J
Pearson, William R
author_facet Sierk, Michael L
Smoot, Michael E
Bass, Ellen J
Pearson, William R
author_sort Sierk, Michael L
collection PubMed
description BACKGROUND: While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate. RESULTS: We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10(-5)) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance. CONCLUSIONS: The pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.
format Text
id pubmed-2850363
institution National Center for Biotechnology Information
language English
publishDate 2010
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-28503632010-04-07 Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments Sierk, Michael L Smoot, Michael E Bass, Ellen J Pearson, William R BMC Bioinformatics Methodology article BACKGROUND: While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate. RESULTS: We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10(-5)) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance. CONCLUSIONS: The pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments. BioMed Central 2010-03-22 /pmc/articles/PMC2850363/ /pubmed/20307279 http://dx.doi.org/10.1186/1471-2105-11-146 Text en Copyright ©2010 Sierk et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Methodology article
Sierk, Michael L
Smoot, Michael E
Bass, Ellen J
Pearson, William R
Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments
title Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments
title_full Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments
title_fullStr Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments
title_full_unstemmed Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments
title_short Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments
title_sort improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments
topic Methodology article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2850363/
https://www.ncbi.nlm.nih.gov/pubmed/20307279
http://dx.doi.org/10.1186/1471-2105-11-146
work_keys_str_mv AT sierkmichaell improvingpairwisesequencealignmentaccuracyusingnearoptimalproteinsequencealignments
AT smootmichaele improvingpairwisesequencealignmentaccuracyusingnearoptimalproteinsequencealignments
AT bassellenj improvingpairwisesequencealignmentaccuracyusingnearoptimalproteinsequencealignments
AT pearsonwilliamr improvingpairwisesequencealignmentaccuracyusingnearoptimalproteinsequencealignments