Cargando…
A modern maximum-likelihood theory for high-dimensional logistic regression
Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-k...
Autores principales: | , |
---|---|
Formato: | Online Artículo Texto |
Lenguaje: | English |
Publicado: |
National Academy of Sciences
2019
|
Materias: | |
Acceso en línea: | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642380/ https://www.ncbi.nlm.nih.gov/pubmed/31262828 http://dx.doi.org/10.1073/pnas.1810420116 |
_version_ | 1783436966643105792 |
---|---|
author | Sur, Pragya Candès, Emmanuel J. |
author_facet | Sur, Pragya Candès, Emmanuel J. |
author_sort | Sur, Pragya |
collection | PubMed |
description | Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ(2). The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure. |
format | Online Article Text |
id | pubmed-6642380 |
institution | National Center for Biotechnology Information |
language | English |
publishDate | 2019 |
publisher | National Academy of Sciences |
record_format | MEDLINE/PubMed |
spelling | pubmed-66423802019-07-25 A modern maximum-likelihood theory for high-dimensional logistic regression Sur, Pragya Candès, Emmanuel J. Proc Natl Acad Sci U S A PNAS Plus Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ(2). The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure. National Academy of Sciences 2019-07-16 2019-07-01 /pmc/articles/PMC6642380/ /pubmed/31262828 http://dx.doi.org/10.1073/pnas.1810420116 Text en Copyright © 2019 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) . |
spellingShingle | PNAS Plus Sur, Pragya Candès, Emmanuel J. A modern maximum-likelihood theory for high-dimensional logistic regression |
title | A modern maximum-likelihood theory for high-dimensional logistic regression |
title_full | A modern maximum-likelihood theory for high-dimensional logistic regression |
title_fullStr | A modern maximum-likelihood theory for high-dimensional logistic regression |
title_full_unstemmed | A modern maximum-likelihood theory for high-dimensional logistic regression |
title_short | A modern maximum-likelihood theory for high-dimensional logistic regression |
title_sort | modern maximum-likelihood theory for high-dimensional logistic regression |
topic | PNAS Plus |
url | https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642380/ https://www.ncbi.nlm.nih.gov/pubmed/31262828 http://dx.doi.org/10.1073/pnas.1810420116 |
work_keys_str_mv | AT surpragya amodernmaximumlikelihoodtheoryforhighdimensionallogisticregression AT candesemmanuelj amodernmaximumlikelihoodtheoryforhighdimensionallogisticregression AT surpragya modernmaximumlikelihoodtheoryforhighdimensionallogisticregression AT candesemmanuelj modernmaximumlikelihoodtheoryforhighdimensionallogisticregression |