Cargando…

Optimizing substitution matrix choice and gap parameters for sequence alignment

BACKGROUND: While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than ho...

Descripción completa

Detalles Bibliográficos
Autor principal: Edgar, Robert C
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2009
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2791778/
https://www.ncbi.nlm.nih.gov/pubmed/19954534
http://dx.doi.org/10.1186/1471-2105-10-396
_version_ 1782175202548908032
author Edgar, Robert C
author_facet Edgar, Robert C
author_sort Edgar, Robert C
collection PubMed
description BACKGROUND: While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. RESULTS: POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. CONCLUSION: The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop.
format Text
id pubmed-2791778
institution National Center for Biotechnology Information
language English
publishDate 2009
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-27917782009-12-11 Optimizing substitution matrix choice and gap parameters for sequence alignment Edgar, Robert C BMC Bioinformatics Research article BACKGROUND: While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. RESULTS: POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. CONCLUSION: The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop. BioMed Central 2009-12-02 /pmc/articles/PMC2791778/ /pubmed/19954534 http://dx.doi.org/10.1186/1471-2105-10-396 Text en Copyright ©2009 Edgar; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research article
Edgar, Robert C
Optimizing substitution matrix choice and gap parameters for sequence alignment
title Optimizing substitution matrix choice and gap parameters for sequence alignment
title_full Optimizing substitution matrix choice and gap parameters for sequence alignment
title_fullStr Optimizing substitution matrix choice and gap parameters for sequence alignment
title_full_unstemmed Optimizing substitution matrix choice and gap parameters for sequence alignment
title_short Optimizing substitution matrix choice and gap parameters for sequence alignment
title_sort optimizing substitution matrix choice and gap parameters for sequence alignment
topic Research article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2791778/
https://www.ncbi.nlm.nih.gov/pubmed/19954534
http://dx.doi.org/10.1186/1471-2105-10-396
work_keys_str_mv AT edgarrobertc optimizingsubstitutionmatrixchoiceandgapparametersforsequencealignment