Cargando…

OASIS: An interpretable, finite-sample valid alternative to Pearson’s [Formula: see text] for scientific discovery

Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. I...

Descripción completa

Detalles Bibliográficos
Autores principales:	Baharav, Tavor Z., Tse, David, Salzman, Julia
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	Cold Spring Harbor Laboratory 2023
Materias:	Article
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10634974/ https://www.ncbi.nlm.nih.gov/pubmed/37961606 http://dx.doi.org/10.1101/2023.03.16.533008

Descripción
Sumario:	Contingency tables, data represented as counts matrices, are ubiquitous across quantitative research and data-science applications. Existing statistical tests are insufficient however, as none are simultaneously computationally efficient and statistically valid for a finite number of observations. In this work, motivated by a recent application in reference-free genomic inference (1), we develop OASIS (Optimized Adaptive Statistic for Inferring Structure), a family of statistical tests for contingency tables. OASIS constructs a test-statistic which is linear in the normalized data matrix, providing closed form p-value bounds through classical concentration inequalities. In the process, OASIS provides a decomposition of the table, lending interpretability to its rejection of the null. We derive the asymptotic distribution of the OASIS test statistic, showing that these finite-sample bounds correctly characterize the test statistic’s p-value up to a variance term. Experiments on genomic sequencing data highlight the power and interpretability of OASIS. The same method based on OASIS significance calls detects SARS-CoV-2 and Mycobacterium Tuberculosis strains de novo, which cannot be achieved with current approaches. We demonstrate in simulations that OASIS is robust to overdispersion, a common feature in genomic data like single cell RNA-sequencing, where under accepted noise models OASIS still provides good control of the false discovery rate, while Pearson’s [Formula: see text] test consistently rejects the null. Additionally, we show on synthetic data that OASIS is more powerful than Pearson’s [Formula: see text] test in certain regimes, including for some important two group alternatives, which we corroborate with approximate power calculations.

OASIS: An interpretable, finite-sample valid alternative to Pearson’s [Formula: see text] for scientific discovery

Ejemplares similares