Cargando…

A modern maximum-likelihood theory for high-dimensional logistic regression

Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-k...

Descripción completa

Detalles Bibliográficos
Autores principales:	Sur, Pragya, Candès, Emmanuel J.
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	National Academy of Sciences 2019
Materias:	PNAS Plus
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642380/ https://www.ncbi.nlm.nih.gov/pubmed/31262828 http://dx.doi.org/10.1073/pnas.1810420116

_version_	1783436966643105792
author	Sur, Pragya Candès, Emmanuel J.
author_facet	Sur, Pragya Candès, Emmanuel J.
author_sort	Sur, Pragya
collection	PubMed
description	Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ(2). The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure.
format	Online Article Text
id	pubmed-6642380
institution	National Center for Biotechnology Information
language	English
publishDate	2019
publisher	National Academy of Sciences
record_format	MEDLINE/PubMed
spelling	pubmed-66423802019-07-25 A modern maximum-likelihood theory for high-dimensional logistic regression Sur, Pragya Candès, Emmanuel J. Proc Natl Acad Sci U S A PNAS Plus Students in statistics or data science usually learn early on that when the sample size [Formula: see text] is large relative to the number of variables [Formula: see text] , fitting a logistic model by the method of maximum likelihood produces estimates that are consistent and that there are well-known formulas that quantify the variability of these estimates which are used for the purpose of statistical inference. We are often told that these calculations are approximately valid if we have 5 to 10 observations per unknown parameter. This paper shows that this is far from the case, and consequently, inferences produced by common software packages are often unreliable. Consider a logistic model with independent features in which n and p become increasingly large in a fixed ratio. We prove that (i) the maximum-likelihood estimate (MLE) is biased, (ii) the variability of the MLE is far greater than classically estimated, and (iii) the likelihood-ratio test (LRT) is not distributed as a χ(2). The bias of the MLE yields wrong predictions for the probability of a case based on observed values of the covariates. We present a theory, which provides explicit expressions for the asymptotic bias and variance of the MLE and the asymptotic distribution of the LRT. We empirically demonstrate that these results are accurate in finite samples. Our results depend only on a single measure of signal strength, which leads to concrete proposals for obtaining accurate inference in finite samples through the estimate of this measure. National Academy of Sciences 2019-07-16 2019-07-01 /pmc/articles/PMC6642380/ /pubmed/31262828 http://dx.doi.org/10.1073/pnas.1810420116 Text en Copyright © 2019 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle	PNAS Plus Sur, Pragya Candès, Emmanuel J. A modern maximum-likelihood theory for high-dimensional logistic regression
title	A modern maximum-likelihood theory for high-dimensional logistic regression
title_full	A modern maximum-likelihood theory for high-dimensional logistic regression
title_fullStr	A modern maximum-likelihood theory for high-dimensional logistic regression
title_full_unstemmed	A modern maximum-likelihood theory for high-dimensional logistic regression
title_short	A modern maximum-likelihood theory for high-dimensional logistic regression
title_sort	modern maximum-likelihood theory for high-dimensional logistic regression
topic	PNAS Plus
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6642380/ https://www.ncbi.nlm.nih.gov/pubmed/31262828 http://dx.doi.org/10.1073/pnas.1810420116
work_keys_str_mv	AT surpragya amodernmaximumlikelihoodtheoryforhighdimensionallogisticregression AT candesemmanuelj amodernmaximumlikelihoodtheoryforhighdimensionallogisticregression AT surpragya modernmaximumlikelihoodtheoryforhighdimensionallogisticregression AT candesemmanuelj modernmaximumlikelihoodtheoryforhighdimensionallogisticregression

A modern maximum-likelihood theory for high-dimensional logistic regression

Ejemplares similares