Cargando…

Testing statistical significance scores of sequence comparison methods with structure similarity

BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical s...

Descripción completa

Detalles Bibliográficos
Autores principales: Hulsen, Tim, de Vlieg, Jacob, Leunissen, Jack AM, Groenen, Peter MA
Formato: Texto
Lenguaje:English
Publicado: BioMed Central 2006
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1618413/
https://www.ncbi.nlm.nih.gov/pubmed/17038163
http://dx.doi.org/10.1186/1471-2105-7-444
_version_ 1782130521338281984
author Hulsen, Tim
de Vlieg, Jacob
Leunissen, Jack AM
Groenen, Peter MA
author_facet Hulsen, Tim
de Vlieg, Jacob
Leunissen, Jack AM
Groenen, Peter MA
author_sort Hulsen, Tim
collection PubMed
description BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.
format Text
id pubmed-1618413
institution National Center for Biotechnology Information
language English
publishDate 2006
publisher BioMed Central
record_format MEDLINE/PubMed
spelling pubmed-16184132006-10-20 Testing statistical significance scores of sequence comparison methods with structure similarity Hulsen, Tim de Vlieg, Jacob Leunissen, Jack AM Groenen, Peter MA BMC Bioinformatics Research Article BACKGROUND: In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences. RESULTS: All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores. CONCLUSION: The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons. BioMed Central 2006-10-12 /pmc/articles/PMC1618413/ /pubmed/17038163 http://dx.doi.org/10.1186/1471-2105-7-444 Text en Copyright © 2006 Hulsen et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Research Article
Hulsen, Tim
de Vlieg, Jacob
Leunissen, Jack AM
Groenen, Peter MA
Testing statistical significance scores of sequence comparison methods with structure similarity
title Testing statistical significance scores of sequence comparison methods with structure similarity
title_full Testing statistical significance scores of sequence comparison methods with structure similarity
title_fullStr Testing statistical significance scores of sequence comparison methods with structure similarity
title_full_unstemmed Testing statistical significance scores of sequence comparison methods with structure similarity
title_short Testing statistical significance scores of sequence comparison methods with structure similarity
title_sort testing statistical significance scores of sequence comparison methods with structure similarity
topic Research Article
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1618413/
https://www.ncbi.nlm.nih.gov/pubmed/17038163
http://dx.doi.org/10.1186/1471-2105-7-444
work_keys_str_mv AT hulsentim testingstatisticalsignificancescoresofsequencecomparisonmethodswithstructuresimilarity
AT devliegjacob testingstatisticalsignificancescoresofsequencecomparisonmethodswithstructuresimilarity
AT leunissenjackam testingstatisticalsignificancescoresofsequencecomparisonmethodswithstructuresimilarity
AT groenenpeterma testingstatisticalsignificancescoresofsequencecomparisonmethodswithstructuresimilarity