Cargando…

Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins

BACKGROUND: Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset,...

Descripción completa

Detalles Bibliográficos
Autores principales:	Davey, Norman E, Edwards, Richard J, Shields, Denis C
Formato:	Texto
Lenguaje:	English
Publicado:	BioMed Central 2010
Materias:	Research article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2819990/ https://www.ncbi.nlm.nih.gov/pubmed/20055997 http://dx.doi.org/10.1186/1471-2105-11-14

_version_	1782177330322472960
author	Davey, Norman E Edwards, Richard J Shields, Denis C
author_facet	Davey, Norman E Edwards, Richard J Shields, Denis C
author_sort	Davey, Norman E
collection	PubMed
description	BACKGROUND: Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. Here, we develop more exact methods and explore the potential biases of computationally efficient approximations. RESULTS: A widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce p(v), which calculates the probability exactly. Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p. Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues). Sig'(v), which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure. CONCLUSIONS: A method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated.
format	Text
id	pubmed-2819990
institution	National Center for Biotechnology Information
language	English
publishDate	2010
publisher	BioMed Central
record_format	MEDLINE/PubMed
spelling	pubmed-28199902010-02-11 Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins Davey, Norman E Edwards, Richard J Shields, Denis C BMC Bioinformatics Research article BACKGROUND: Large datasets of protein interactions provide a rich resource for the discovery of Short Linear Motifs (SLiMs) that recur in unrelated proteins. However, existing methods for estimating the probability of motif recurrence may be biased by the size and composition of the search dataset, such that p-value estimates from different datasets, or from motifs containing different numbers of non-wildcard positions, are not strictly comparable. Here, we develop more exact methods and explore the potential biases of computationally efficient approximations. RESULTS: A widely used heuristic for the calculation of motif over-representation approximates motif probability by assuming that all proteins have the same length and composition. We introduce p(v), which calculates the probability exactly. Secondly, the recently introduced SLiMFinder statistic Sig, accounts for multiple testing (across all possible motifs) in motif discovery. However, it approximates the probability of all other possible motifs, occurring with a score of p or less, as being equal to p. Here, we show that the exhaustive calculation of the probability of all possible motif occurrences that are as rare or rarer than the motif of interest, Sig', may be carried out efficiently by grouping motifs of a common probability (i.e. those which have permuted orders of the same residues). Sig'(v), which corrects both approximations, is shown to be uniformly distributed in a random dataset when searching for non-ambiguous motifs, indicating that it is a robust significance measure. CONCLUSIONS: A method is presented to compute exactly the true probability of a non-ambiguous short protein sequence motif, and the utility of an approximate approach for novel motif discovery across a large number of datasets is demonstrated. BioMed Central 2010-01-07 /pmc/articles/PMC2819990/ /pubmed/20055997 http://dx.doi.org/10.1186/1471-2105-11-14 Text en Copyright ©2010 Davey et al; licensee BioMed Central Ltd. http://creativecommons.org/licenses/by/2.0 This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
spellingShingle	Research article Davey, Norman E Edwards, Richard J Shields, Denis C Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins
title	Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins
title_full	Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins
title_fullStr	Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins
title_full_unstemmed	Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins
title_short	Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins
title_sort	estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins
topic	Research article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2819990/ https://www.ncbi.nlm.nih.gov/pubmed/20055997 http://dx.doi.org/10.1186/1471-2105-11-14
work_keys_str_mv	AT daveynormane estimationandefficientcomputationofthetrueprobabilityofrecurrenceofshortlinearproteinsequencemotifsinunrelatedproteins AT edwardsrichardj estimationandefficientcomputationofthetrueprobabilityofrecurrenceofshortlinearproteinsequencemotifsinunrelatedproteins AT shieldsdenisc estimationandefficientcomputationofthetrueprobabilityofrecurrenceofshortlinearproteinsequencemotifsinunrelatedproteins

Estimation and efficient computation of the true probability of recurrence of short linear protein sequence motifs in unrelated proteins

Ejemplares similares