Cargando…

Accurate statistical model of comparison between multiple sequence alignments

Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similar...

Descripción completa

Detalles Bibliográficos
Autores principales: Sadreyev, Ruslan I., Grishin, Nick V.
Formato: Texto
Lenguaje:English
Publicado: Oxford University Press 2008
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2367703/
https://www.ncbi.nlm.nih.gov/pubmed/18285364
http://dx.doi.org/10.1093/nar/gkn065
_version_ 1782154353012899840
author Sadreyev, Ruslan I.
Grishin, Nick V.
author_facet Sadreyev, Ruslan I.
Grishin, Nick V.
author_sort Sadreyev, Ruslan I.
collection PubMed
description Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similarities found in a database search, so that biologically relevant relationships are discriminated from spurious connections. Here, we develop an accurate statistical description of MSA comparison that does not originate from conventional models of single sequence comparison and captures essential features of protein families. As a final result, we compute E-values for the similarity between any two MSA using a mathematical function that depends on MSA lengths and sequence diversity. To develop these estimates of statistical significance, we first establish a procedure for generating realistic alignment decoys that reproduce natural patterns of sequence conservation dictated by protein secondary structure. Second, since similarity scores between these alignments do not follow the classic Gumbel extreme value distribution, we propose a novel distribution that yields statistically perfect agreement with the data. Third, we apply this random model to database searches and show that it surpasses conventional models in the accuracy of detecting remote protein similarities.
format Text
id pubmed-2367703
institution National Center for Biotechnology Information
language English
publishDate 2008
publisher Oxford University Press
record_format MEDLINE/PubMed
spelling pubmed-23677032008-05-07 Accurate statistical model of comparison between multiple sequence alignments Sadreyev, Ruslan I. Grishin, Nick V. Nucleic Acids Res Computational Biology Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similarities found in a database search, so that biologically relevant relationships are discriminated from spurious connections. Here, we develop an accurate statistical description of MSA comparison that does not originate from conventional models of single sequence comparison and captures essential features of protein families. As a final result, we compute E-values for the similarity between any two MSA using a mathematical function that depends on MSA lengths and sequence diversity. To develop these estimates of statistical significance, we first establish a procedure for generating realistic alignment decoys that reproduce natural patterns of sequence conservation dictated by protein secondary structure. Second, since similarity scores between these alignments do not follow the classic Gumbel extreme value distribution, we propose a novel distribution that yields statistically perfect agreement with the data. Third, we apply this random model to database searches and show that it surpasses conventional models in the accuracy of detecting remote protein similarities. Oxford University Press 2008-04 2008-02-19 /pmc/articles/PMC2367703/ /pubmed/18285364 http://dx.doi.org/10.1093/nar/gkn065 Text en © 2008 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle Computational Biology
Sadreyev, Ruslan I.
Grishin, Nick V.
Accurate statistical model of comparison between multiple sequence alignments
title Accurate statistical model of comparison between multiple sequence alignments
title_full Accurate statistical model of comparison between multiple sequence alignments
title_fullStr Accurate statistical model of comparison between multiple sequence alignments
title_full_unstemmed Accurate statistical model of comparison between multiple sequence alignments
title_short Accurate statistical model of comparison between multiple sequence alignments
title_sort accurate statistical model of comparison between multiple sequence alignments
topic Computational Biology
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2367703/
https://www.ncbi.nlm.nih.gov/pubmed/18285364
http://dx.doi.org/10.1093/nar/gkn065
work_keys_str_mv AT sadreyevruslani accuratestatisticalmodelofcomparisonbetweenmultiplesequencealignments
AT grishinnickv accuratestatisticalmodelofcomparisonbetweenmultiplesequencealignments