Cargando…
Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail
BACKGROUND: The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability re...
Autores principales: | , , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
BioMed Central
2007
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1945026/ https://www.ncbi.nlm.nih.gov/pubmed/17625018 http://dx.doi.org/10.1186/1748-7188-2-9 |
_version_ | 1782134482064637952 |
---|---|
author | Wolfsheimer, Stefan Burghardt, Bernd Hartmann, Alexander K |
author_facet | Wolfsheimer, Stefan Burghardt, Bernd Hartmann, Alexander K |
author_sort | Wolfsheimer, Stefan |
collection | PubMed |
description | BACKGROUND: The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant. RESULTS: We provide a method to obtain numerically the biologically relevant rare-event tail of the distribution. The method, which has been outlined in an earlier work, is based on generating the sequences with a parametrized probability distribution, which is biased with respect to the original biological one, in the framework of Metropolis Coupled Markov Chain Monte Carlo. Here, we first present the approach in detail and evaluate the convergence of the algorithm by considering a simple test case. In the earlier work, the method was just applied to one single example case. Therefore, we consider here a large set of parameters: We study the distributions for protein alignment with different substitution matrices (BLOSUM62 and PAM250) and affine gap costs with different parameter values. In the logarithmic phase (large gap costs) it was previously assumed that the Gumbel form still holds, hence the Gumbel distribution is usually used when evaluating p-values in databases. Here we show that for all cases, provided that the sequences are not too long (L > 400), a "modified" Gumbel distribution, i.e. a Gumbel distribution with an additional Gaussian factor is suitable to describe the data. We also provide a "scaling analysis" of the parameters used in the modified Gumbel distribution. Furthermore, via a comparison with BLAST parameters, we show that significance estimations change considerably when using the true distributions as presented here. Finally, we study also the distribution of the sum statistics of the k best alignments. CONCLUSION: Our results show that the statistics of gapped and ungapped local alignments deviates significantly from Gumbel in the rare-event tail. We provide a Gaussian correction to the distribution and an analysis of its scaling behavior for several different scoring parameter sets, which are commonly used to search protein data bases. The case of sum statistics of k best alignments is included. |
format | Text |
id | pubmed-1945026 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2007 |
publisher | BioMed Central |
record_format | MEDLINE/PubMed |
spelling | pubmed-19450262007-08-11 Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail Wolfsheimer, Stefan Burghardt, Bernd Hartmann, Alexander K Algorithms Mol Biol Research BACKGROUND: The optimal score for ungapped local alignments of infinitely long random sequences is known to follow a Gumbel extreme value distribution. Less is known about the important case, where gaps are allowed. For this case, the distribution is only known empirically in the high-probability region, which is biologically less relevant. RESULTS: We provide a method to obtain numerically the biologically relevant rare-event tail of the distribution. The method, which has been outlined in an earlier work, is based on generating the sequences with a parametrized probability distribution, which is biased with respect to the original biological one, in the framework of Metropolis Coupled Markov Chain Monte Carlo. Here, we first present the approach in detail and evaluate the convergence of the algorithm by considering a simple test case. In the earlier work, the method was just applied to one single example case. Therefore, we consider here a large set of parameters: We study the distributions for protein alignment with different substitution matrices (BLOSUM62 and PAM250) and affine gap costs with different parameter values. In the logarithmic phase (large gap costs) it was previously assumed that the Gumbel form still holds, hence the Gumbel distribution is usually used when evaluating p-values in databases. Here we show that for all cases, provided that the sequences are not too long (L > 400), a "modified" Gumbel distribution, i.e. a Gumbel distribution with an additional Gaussian factor is suitable to describe the data. We also provide a "scaling analysis" of the parameters used in the modified Gumbel distribution. Furthermore, via a comparison with BLAST parameters, we show that significance estimations change considerably when using the true distributions as presented here. Finally, we study also the distribution of the sum statistics of the k best alignments. CONCLUSION: Our results show that the statistics of gapped and ungapped local alignments deviates significantly from Gumbel in the rare-event tail. We provide a Gaussian correction to the distribution and an analysis of its scaling behavior for several different scoring parameter sets, which are commonly used to search protein data bases. The case of sum statistics of k best alignments is included. BioMed Central 2007-07-11 /pmc/articles/PMC1945026/ /pubmed/17625018 http://dx.doi.org/10.1186/1748-7188-2-9 Text en Copyright © 2007 Wolfsheimer et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License ( (http://creativecommons.org/licenses/by/2.0) ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Research Wolfsheimer, Stefan Burghardt, Bernd Hartmann, Alexander K Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail |
title | Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail |
title_full | Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail |
title_fullStr | Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail |
title_full_unstemmed | Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail |
title_short | Local sequence alignments statistics: deviations from Gumbel statistics in the rare-event tail |
title_sort | local sequence alignments statistics: deviations from gumbel statistics in the rare-event tail |
topic | Research |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1945026/ https://www.ncbi.nlm.nih.gov/pubmed/17625018 http://dx.doi.org/10.1186/1748-7188-2-9 |
work_keys_str_mv | AT wolfsheimerstefan localsequencealignmentsstatisticsdeviationsfromgumbelstatisticsintherareeventtail AT burghardtbernd localsequencealignmentsstatisticsdeviationsfromgumbelstatisticsintherareeventtail AT hartmannalexanderk localsequencealignmentsstatisticsdeviationsfromgumbelstatisticsintherareeventtail |