Cargando…
A modern maximum-likelihood theory for high-dimensional logistic regression
Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-k...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
National Academy of Sciences
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642380/ https://www.ncbi.nlm.nih.gov/pubmed/31262828 http://dx.doi.org/10.1073/pnas.1810420116 |
Sumario: | Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ(2). The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure. |
---|