Cargando…
Accurate statistical model of comparison between multiple sequence alignments
Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similar...
Autores principales: | , |
---|---|
Formato: | Texto |
Lenguaje: | English |
Publicado: |
Oxford University Press
2008
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2367703/ https://www.ncbi.nlm.nih.gov/pubmed/18285364 http://dx.doi.org/10.1093/nar/gkn065 |
_version_ | 1782154353012899840 |
---|---|
author | Sadreyev, Ruslan I. Grishin, Nick V. |
author_facet | Sadreyev, Ruslan I. Grishin, Nick V. |
author_sort | Sadreyev, Ruslan I. |
collection | PubMed |
description | Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similarities found in a database search, so that biologically relevant relationships are discriminated from spurious connections. Here, we develop an accurate statistical description of MSA comparison that does not originate from conventional models of single sequence comparison and captures essential features of protein families. As a final result, we compute E-values for the similarity between any two MSA using a mathematical function that depends on MSA lengths and sequence diversity. To develop these estimates of statistical significance, we first establish a procedure for generating realistic alignment decoys that reproduce natural patterns of sequence conservation dictated by protein secondary structure. Second, since similarity scores between these alignments do not follow the classic Gumbel extreme value distribution, we propose a novel distribution that yields statistically perfect agreement with the data. Third, we apply this random model to database searches and show that it surpasses conventional models in the accuracy of detecting remote protein similarities. |
format | Text |
id | pubmed-2367703 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2008 |
publisher | Oxford University Press |
record_format | MEDLINE/PubMed |
spelling | pubmed-23677032008-05-07 Accurate statistical model of comparison between multiple sequence alignments Sadreyev, Ruslan I. Grishin, Nick V. Nucleic Acids Res Computational Biology Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similarities found in a database search, so that biologically relevant relationships are discriminated from spurious connections. Here, we develop an accurate statistical description of MSA comparison that does not originate from conventional models of single sequence comparison and captures essential features of protein families. As a final result, we compute E-values for the similarity between any two MSA using a mathematical function that depends on MSA lengths and sequence diversity. To develop these estimates of statistical significance, we first establish a procedure for generating realistic alignment decoys that reproduce natural patterns of sequence conservation dictated by protein secondary structure. Second, since similarity scores between these alignments do not follow the classic Gumbel extreme value distribution, we propose a novel distribution that yields statistically perfect agreement with the data. Third, we apply this random model to database searches and show that it surpasses conventional models in the accuracy of detecting remote protein similarities. Oxford University Press 2008-04 2008-02-19 /pmc/articles/PMC2367703/ /pubmed/18285364 http://dx.doi.org/10.1093/nar/gkn065 Text en © 2008 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited. |
spellingShingle | Computational Biology Sadreyev, Ruslan I. Grishin, Nick V. Accurate statistical model of comparison between multiple sequence alignments |
title | Accurate statistical model of comparison between multiple sequence alignments |
title_full | Accurate statistical model of comparison between multiple sequence alignments |
title_fullStr | Accurate statistical model of comparison between multiple sequence alignments |
title_full_unstemmed | Accurate statistical model of comparison between multiple sequence alignments |
title_short | Accurate statistical model of comparison between multiple sequence alignments |
title_sort | accurate statistical model of comparison between multiple sequence alignments |
topic | Computational Biology |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2367703/ https://www.ncbi.nlm.nih.gov/pubmed/18285364 http://dx.doi.org/10.1093/nar/gkn065 |
work_keys_str_mv | AT sadreyevruslani accuratestatisticalmodelofcomparisonbetweenmultiplesequencealignments AT grishinnickv accuratestatisticalmodelofcomparisonbetweenmultiplesequencealignments |