Cargando…
Testing statistical significance scores of sequence comparison methods with structure similarity
BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical s...
Autores principales: | , , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2006
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1618413/ https://www.ncbi.nlm.nih.gov/pubmed/17038163 http://dx.doi.org/10.1186/1471-2105-7-444 |
_version_ | 1782130521338281984 |
---|---|
author | Hulsen, Tim de Vlieg, Jacob Leunissen, Jack AM Groenen, Peter MA |
author_facet | Hulsen, Tim de Vlieg, Jacob Leunissen, Jack AM Groenen, Peter MA |
author_sort | Hulsen, Tim |
collection | PubMed |
description | BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons. |
format | Text |
id | pubmed-1618413 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2006 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-16184132006-10-20 Testing statistical significance scores of sequence comparison methods with structure similarity Hulsen, Tim de Vlieg, Jacob Leunissen, Jack AM Groenen, Peter MA BMC Bioinformatics Research Article BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons. BioMed Central 2006-10-12 /pmc/articles/PMC1618413/ /pubmed/17038163 http://dx.doi.org/10.1186/1471-2105-7-444 Text en Copyright © 2006 Hulsen et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Article Hulsen, Tim de Vlieg, Jacob Leunissen, Jack AM Groenen, Peter MA Testing statistical significance scores of sequence comparison methods with structure similarity |
title | Testing statistical significance scores of sequence comparison methods with structure similarity |
title_full | Testing statistical significance scores of sequence comparison methods with structure similarity |
title_fullStr | Testing statistical significance scores of sequence comparison methods with structure similarity |
title_full_unstemmed | Testing statistical significance scores of sequence comparison methods with structure similarity |
title_short | Testing statistical significance scores of sequence comparison methods with structure similarity |
title_sort | testing statistical significance scores of sequence comparison methods with structure similarity |
topic | Research Article |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1618413/ https://www.ncbi.nlm.nih.gov/pubmed/17038163 http://dx.doi.org/10.1186/1471-2105-7-444 |
work_keys_str_mv | AT hulsentim testingstatisticalsignificancescoresofsequencecomparisonmethodswithstructuresimilarity AT devliegjacob testingstatisticalsignificancescoresofsequencecomparisonmethodswithstructuresimilarity AT leunissenjackam testingstatisticalsignificancescoresofsequencecomparisonmethodswithstructuresimilarity AT groenenpeterma testingstatisticalsignificancescoresofsequencecomparisonmethodswithstructuresimilarity |