Cargando…

Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multipl...

Descripción completa

Detalles Bibliográficos
Autores principales:	Ochoa, Alejandro, Storey, John D., Llinás, Manuel, Singh, Mona
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Public Library of Science 2015
Materias:	Research Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4648515/ https://www.ncbi.nlm.nih.gov/pubmed/26575353 http://dx.doi.org/10.1371/journal.pcbi.1004509

_version_	1782401248222248960
author	Ochoa, Alejandro Storey, John D. Llinás, Manuel Singh, Mona
author_facet	Ochoa, Alejandro Storey, John D. Llinás, Manuel Singh, Mona
author_sort	Ochoa, Alejandro
collection	PubMed
description	E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning.
format	Online Article Text
id	pubmed-4648515
institution	National Center for Biotechnology Information
language	English
publishDate	2015
publisher	Public Library of Science
record_format	MEDLINE/PubMed
spelling	pubmed-46485152015-11-25 Beyond the E-Value: Stratified Statistics for Protein Domain Prediction Ochoa, Alejandro Storey, John D. Llinás, Manuel Singh, Mona PLoS Comput Biol Research Article E-values have been the dominant statistic for protein sequence analysis for the past two decades: from identifying statistically significant local sequence alignments to evaluating matches to hidden Markov models describing protein domain families. Here we formally show that for “stratified” multiple hypothesis testing problems—that is, those in which statistical tests can be partitioned naturally—controlling the local False Discovery Rate (lFDR) per stratum, or partition, yields the most predictions across the data at any given threshold on the FDR or E-value over all strata combined. For the important problem of protein domain prediction, a key step in characterizing protein structure, function and evolution, we show that stratifying statistical tests by domain family yields excellent results. We develop the first FDR-estimating algorithms for domain prediction, and evaluate how well thresholds based on q-values, E-values and lFDRs perform in domain prediction using five complementary approaches for estimating empirical FDRs in this context. We show that stratified q-value thresholds substantially outperform E-values. Contradicting our theoretical results, q-values also outperform lFDRs; however, our tests reveal a small but coherent subset of domain families, biased towards models for specific repetitive patterns, for which weaknesses in random sequence models yield notably inaccurate statistical significance measures. Usage of lFDR thresholds outperform q-values for the remaining families, which have as-expected noise, suggesting that further improvements in domain predictions can be achieved with improved modeling of random sequences. Overall, our theoretical and empirical findings suggest that the use of stratified q-values and lFDRs could result in improvements in a host of structured multiple hypothesis testing problems arising in bioinformatics, including genome-wide association studies, orthology prediction, and motif scanning. Public Library of Science 2015-11-17 /pmc/articles/PMC4648515/ /pubmed/26575353 http://dx.doi.org/10.1371/journal.pcbi.1004509 Text en © 2015 Ochoa et al http://creativecommons.org/licenses/by/4.0/ This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are properly credited.
spellingShingle	Research Article Ochoa, Alejandro Storey, John D. Llinás, Manuel Singh, Mona Beyond the E-Value: Stratified Statistics for Protein Domain Prediction
title	Beyond the E-Value: Stratified Statistics for Protein Domain Prediction
title_full	Beyond the E-Value: Stratified Statistics for Protein Domain Prediction
title_fullStr	Beyond the E-Value: Stratified Statistics for Protein Domain Prediction
title_full_unstemmed	Beyond the E-Value: Stratified Statistics for Protein Domain Prediction
title_short	Beyond the E-Value: Stratified Statistics for Protein Domain Prediction
title_sort	beyond the e-value: stratified statistics for protein domain prediction
topic	Research Article
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4648515/ https://www.ncbi.nlm.nih.gov/pubmed/26575353 http://dx.doi.org/10.1371/journal.pcbi.1004509
work_keys_str_mv	AT ochoaalejandro beyondtheevaluestratifiedstatisticsforproteindomainprediction AT storeyjohnd beyondtheevaluestratifiedstatisticsforproteindomainprediction AT llinasmanuel beyondtheevaluestratifiedstatisticsforproteindomainprediction AT singhmona beyondtheevaluestratifiedstatisticsforproteindomainprediction

Beyond the E-Value: Stratified Statistics for Protein Domain Prediction

Ejemplares similares