Cargando…
Optimizing substitution matrix choice and gap parameters for sequence alignment
BACKGROUND: While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than ho...
Autor principal: | |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2009
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2791778/ https://www.ncbi.nlm.nih.gov/pubmed/19954534 http://dx.doi.org/10.1186/1471-2105-10-396 |
_version_ | 1782175202548908032 |
---|---|
author | Edgar, Robert C |
author_facet | Edgar, Robert C |
author_sort | Edgar, Robert C |
collection | PubMed |
description | BACKGROUND: While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. RESULTS: POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. CONCLUSION: The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop. |
format | Text |
id | pubmed-2791778 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2009 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-27917782009-12-11 Optimizing substitution matrix choice and gap parameters for sequence alignment Edgar, Robert C BMC Bioinformatics Research article BACKGROUND: While substitution matrices can readily be computed from reference alignments, it is challenging to compute optimal or approximately optimal gap penalties. It is also not well understood which substitution matrices are the most effective when alignment accuracy is the goal rather than homolog recognition. Here a new parameter optimization procedure, POP, is described and applied to the problems of optimizing gap penalties and selecting substitution matrices for pair-wise global protein alignments. RESULTS: POP is compared to a recent method due to Kim and Kececioglu and found to achieve from 0.2% to 1.3% higher accuracies on pair-wise benchmarks extracted from BALIBASE. The VTML matrix series is shown to be the most accurate on several global pair-wise alignment benchmarks, with VTML200 giving best or close to the best performance in all tests. BLOSUM matrices are found to be slightly inferior, even with the marginal improvements in the bug-fixed RBLOSUM series. The PAM series is significantly worse, giving accuracies typically 2% less than VTML. Integer rounding is found to cause slight degradations in accuracy. No evidence is found that selecting a matrix based on sequence divergence improves accuracy, suggesting that the use of this heuristic in CLUSTALW may be ineffective. Using VTML200 is found to improve the accuracy of CLUSTALW by 8% on BALIBASE and 5% on PREFAB. CONCLUSION: The hypothesis that more accurate alignments of distantly related sequences may be achieved using low-identity matrices is shown to be false for commonly used matrix types. Source code and test data is freely available from the author's web site at http://www.drive5.com/pop. BioMed Central 2009-12-02 /pmc/articles/PMC2791778/ /pubmed/19954534 http://dx.doi.org/10.1186/1471-2105-10-396 Text en Copyright ©2009 Edgar; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research article Edgar, Robert C Optimizing substitution matrix choice and gap parameters for sequence alignment |
title | Optimizing substitution matrix choice and gap parameters for sequence alignment |
title_full | Optimizing substitution matrix choice and gap parameters for sequence alignment |
title_fullStr | Optimizing substitution matrix choice and gap parameters for sequence alignment |
title_full_unstemmed | Optimizing substitution matrix choice and gap parameters for sequence alignment |
title_short | Optimizing substitution matrix choice and gap parameters for sequence alignment |
title_sort | optimizing substitution matrix choice and gap parameters for sequence alignment |
topic | Research article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2791778/ https://www.ncbi.nlm.nih.gov/pubmed/19954534 http://dx.doi.org/10.1186/1471-2105-10-396 |
work_keys_str_mv | AT edgarrobertc optimizingsubstitutionmatrixchoiceandgapparametersforsequencealignment |