Cargando…

Bayesian interpolation with deep linear networks

Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weig...

Descripción completa

Detalles Bibliográficos
Autores principales:	Hanin, Boris, Zlokapa, Alexander
Formato:	Online Artículo Texto
Lenguaje:	English
Publicado:	National Academy of Sciences 2023
Materias:	Physical Sciences
Acceso en línea:	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10266010/ https://www.ncbi.nlm.nih.gov/pubmed/37252994 http://dx.doi.org/10.1073/pnas.2301345120

_version_	1785145915499085824
author	Hanin, Boris Zlokapa, Alexander
author_facet	Hanin, Boris Zlokapa, Alexander
author_sort	Hanin, Boris
collection	PubMed
description	Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network depth, and hidden layer widths, we find non-asymptotic expressions for the predictive posterior and Bayesian model evidence in terms of Meijer-G functions, a class of meromorphic special functions of a single complex variable. Through novel asymptotic expansions of these Meijer-G functions, a rich new picture of the joint role of depth, width, and dataset size emerges. We show that linear networks make provably optimal predictions at infinite depth: the posterior of infinitely deep linear networks with data-agnostic priors is the same as that of shallow networks with evidence-maximizing data-dependent priors. This yields a principled reason to prefer deeper networks when priors are forced to be data-agnostic. Moreover, we show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth, elucidating the salutary role of increased depth for model selection. Underpinning our results is a novel emergent notion of effective depth, given by the number of hidden layers times the number of data points divided by the network width; this determines the structure of the posterior in the large-data limit.
format	Online Article Text
id	pubmed-10266010
institution	National Center for Biotechnology Information
language	English
publishDate	2023
publisher	National Academy of Sciences
record_format	MEDLINE/PubMed
spelling	pubmed-102660102023-11-30 Bayesian interpolation with deep linear networks Hanin, Boris Zlokapa, Alexander Proc Natl Acad Sci U S A Physical Sciences Characterizing how neural network depth, width, and dataset size jointly impact model quality is a central problem in deep learning theory. We give here a complete solution in the special case of linear networks with output dimension one trained using zero noise Bayesian inference with Gaussian weight priors and mean squared error as a negative log-likelihood. For any training dataset, network depth, and hidden layer widths, we find non-asymptotic expressions for the predictive posterior and Bayesian model evidence in terms of Meijer-G functions, a class of meromorphic special functions of a single complex variable. Through novel asymptotic expansions of these Meijer-G functions, a rich new picture of the joint role of depth, width, and dataset size emerges. We show that linear networks make provably optimal predictions at infinite depth: the posterior of infinitely deep linear networks with data-agnostic priors is the same as that of shallow networks with evidence-maximizing data-dependent priors. This yields a principled reason to prefer deeper networks when priors are forced to be data-agnostic. Moreover, we show that with data-agnostic priors, Bayesian model evidence in wide linear networks is maximized at infinite depth, elucidating the salutary role of increased depth for model selection. Underpinning our results is a novel emergent notion of effective depth, given by the number of hidden layers times the number of data points divided by the network width; this determines the structure of the posterior in the large-data limit. National Academy of Sciences 2023-05-30 2023-06-06 /pmc/articles/PMC10266010/ /pubmed/37252994 http://dx.doi.org/10.1073/pnas.2301345120 Text en Copyright © 2023 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/This article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle	Physical Sciences Hanin, Boris Zlokapa, Alexander Bayesian interpolation with deep linear networks
title	Bayesian interpolation with deep linear networks
title_full	Bayesian interpolation with deep linear networks
title_fullStr	Bayesian interpolation with deep linear networks
title_full_unstemmed	Bayesian interpolation with deep linear networks
title_short	Bayesian interpolation with deep linear networks
title_sort	bayesian interpolation with deep linear networks
topic	Physical Sciences
url	https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10266010/ https://www.ncbi.nlm.nih.gov/pubmed/37252994 http://dx.doi.org/10.1073/pnas.2301345120
work_keys_str_mv	AT haninboris bayesianinterpolationwithdeeplinearnetworks AT zlokapaalexander bayesianinterpolationwithdeeplinearnetworks

Bayesian interpolation with deep linear networks

Ejemplares similares