Cargando…

A modern maximum-likelihood theory for high-dimensional logistic regression

Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-k...

Descripción completa

Detalles Bibliográficos
Autores principales: Sur, Pragya, Candès, Emmanuel J.
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2019
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642380/
https://www.ncbi.nlm.nih.gov/pubmed/31262828
http://dx.doi.org/10.1073/pnas.1810420116
_version_ 1783436966643105792
author Sur, Pragya
Candès, Emmanuel J.
author_facet Sur, Pragya
Candès, Emmanuel J.
author_sort Sur, Pragya
collection PubMed
description Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ(2). The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure.
format Online
Article
Text
id pubmed-6642380
institution National Center for Biotechnology Information
language English
publishDate 2019
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-66423802019-07-25 A modern maximum-likelihood theory for high-dimensional logistic regression Sur, Pragya Candès, Emmanuel J. Proc Natl Acad Sci U S A PNAS Plus Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ(2). The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure. National Academy of Sciences 2019-07-16 2019-07-01 /pmc/articles/PMC6642380/ /pubmed/31262828 http://dx.doi.org/10.1073/pnas.1810420116 Text en Copyright © 2019 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle PNAS Plus
Sur, Pragya
Candès, Emmanuel J.
A modern maximum-likelihood theory for high-dimensional logistic regression
title A modern maximum-likelihood theory for high-dimensional logistic regression
title_full A modern maximum-likelihood theory for high-dimensional logistic regression
title_fullStr A modern maximum-likelihood theory for high-dimensional logistic regression
title_full_unstemmed A modern maximum-likelihood theory for high-dimensional logistic regression
title_short A modern maximum-likelihood theory for high-dimensional logistic regression
title_sort modern maximum-likelihood theory for high-dimensional logistic regression
topic PNAS Plus
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642380/
https://www.ncbi.nlm.nih.gov/pubmed/31262828
http://dx.doi.org/10.1073/pnas.1810420116
work_keys_str_mv AT surpragya amodernmaximumlikelihoodtheoryforhighdimensionallogisticregression
AT candesemmanuelj amodernmaximumlikelihoodtheoryforhighdimensionallogisticregression
AT surpragya modernmaximumlikelihoodtheoryforhighdimensionallogisticregression
AT candesemmanuelj modernmaximumlikelihoodtheoryforhighdimensionallogisticregression