Cargando…

Accurate statistical model of comparison between multiple sequence alignments

Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similar...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sadreyev, Ruslan I., Grishin, Nick V.
Formato:	Texto
Lenguaje:	English
Publicado:	Oxford University Press 2008
Materias:	Computational Biology
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2367703/ https://www.ncbi.nlm.nih.gov/pubmed/18285364 http://dx.doi.org/10.1093/nar/gkn065

_version_	1782154353012899840
author	Sadreyev, Ruslan I. Grishin, Nick V.
author_facet	Sadreyev, Ruslan I. Grishin, Nick V.
author_sort	Sadreyev, Ruslan I.
collection	PubMed
description	Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similarities found in a database search, so that biologically relevant relationships are discriminated from spurious connections. Here, we develop an accurate statistical description of MSA comparison that does not originate from conventional models of single sequence comparison and captures essential features of protein families. As a final result, we compute E-values for the similarity between any two MSA using a mathematical function that depends on MSA lengths and sequence diversity. To develop these estimates of statistical significance, we first establish a procedure for generating realistic alignment decoys that reproduce natural patterns of sequence conservation dictated by protein secondary structure. Second, since similarity scores between these alignments do not follow the classic Gumbel extreme value distribution, we propose a novel distribution that yields statistically perfect agreement with the data. Third, we apply this random model to database searches and show that it surpasses conventional models in the accuracy of detecting remote protein similarities.
format	Text
id	pubmed-2367703
institution	National Center for Biotechnology Information
language	English
publishDate	2008
publisher	Oxford University Press
record_format	MEDLINE/PubMed
spelling	pubmed-23677032008-05-07 Accurate statistical model of comparison between multiple sequence alignments Sadreyev, Ruslan I. Grishin, Nick V. Nucleic Acids Res Computational Biology Comparison of multiple protein sequence alignments (MSA) reveals unexpected evolutionary relations between protein families and leads to exciting predictions of spatial structure and function. The power of MSA comparison critically depends on the quality of statistical model used to rank the similarities found in a database search, so that biologically relevant relationships are discriminated from spurious connections. Here, we develop an accurate statistical description of MSA comparison that does not originate from conventional models of single sequence comparison and captures essential features of protein families. As a final result, we compute E-values for the similarity between any two MSA using a mathematical function that depends on MSA lengths and sequence diversity. To develop these estimates of statistical significance, we first establish a procedure for generating realistic alignment decoys that reproduce natural patterns of sequence conservation dictated by protein secondary structure. Second, since similarity scores between these alignments do not follow the classic Gumbel extreme value distribution, we propose a novel distribution that yields statistically perfect agreement with the data. Third, we apply this random model to database searches and show that it surpasses conventional models in the accuracy of detecting remote protein similarities. Oxford University Press 2008-04 2008-02-19 /pmc/articles/PMC2367703/ /pubmed/18285364 http://dx.doi.org/10.1093/nar/gkn065 Text en © 2008 The Author(s) http://creativecommons.org/licenses/by-nc/2.0/uk/ This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Computational Biology Sadreyev, Ruslan I. Grishin, Nick V. Accurate statistical model of comparison between multiple sequence alignments
title	Accurate statistical model of comparison between multiple sequence alignments
title_full	Accurate statistical model of comparison between multiple sequence alignments
title_fullStr	Accurate statistical model of comparison between multiple sequence alignments
title_full_unstemmed	Accurate statistical model of comparison between multiple sequence alignments
title_short	Accurate statistical model of comparison between multiple sequence alignments
title_sort	accurate statistical model of comparison between multiple sequence alignments
topic	Computational Biology
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2367703/ https://www.ncbi.nlm.nih.gov/pubmed/18285364 http://dx.doi.org/10.1093/nar/gkn065
work_keys_str_mv	AT sadreyevruslani accuratestatisticalmodelofcomparisonbetweenmultiplesequencealignments AT grishinnickv accuratestatisticalmodelofcomparisonbetweenmultiplesequencealignments

Accurate statistical model of comparison between multiple sequence alignments

Ejemplares similares