Cargando…

Shaping the learning landscape in neural networks around wide flat minima

Learning in deep neural networks takes place by minimizing a nonconvex high-dimensional loss function, typically by a stochastic gradient descent (SGD) strategy. The learning process is observed to be able to find good minimizers without getting stuck in local critical points and such minimizers are...

Descripción completa

Detalles Bibliográficos
Autores principales: Baldassi, Carlo, Pittorino, Fabrizio, Zecchina, Riccardo
Formato: Online Artículo Texto
Lenguaje:English
Publicado: National Academy of Sciences 2020
Materias:
Acceso en línea:https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6955380/
https://www.ncbi.nlm.nih.gov/pubmed/31871189
http://dx.doi.org/10.1073/pnas.1908636117
_version_ 1783486925571620864
author Baldassi, Carlo
Pittorino, Fabrizio
Zecchina, Riccardo
author_facet Baldassi, Carlo
Pittorino, Fabrizio
Zecchina, Riccardo
author_sort Baldassi, Carlo
collection PubMed
description Learning in deep neural networks takes place by minimizing a nonconvex high-dimensional loss function, typically by a stochastic gradient descent (SGD) strategy. The learning process is observed to be able to find good minimizers without getting stuck in local critical points and such minimizers are often satisfactory at avoiding overfitting. How these 2 features can be kept under control in nonlinear devices composed of millions of tunable connections is a profound and far-reaching open question. In this paper we study basic nonconvex 1- and 2-layer neural network models that learn random patterns and derive a number of basic geometrical and algorithmic features which suggest some answers. We first show that the error loss function presents few extremely wide flat minima (WFM) which coexist with narrower minima and critical points. We then show that the minimizers of the cross-entropy loss function overlap with the WFM of the error loss. We also show examples of learning devices for which WFM do not exist. From the algorithmic perspective we derive entropy-driven greedy and message-passing algorithms that focus their search on wide flat regions of minimizers. In the case of SGD and cross-entropy loss, we show that a slow reduction of the norm of the weights along the learning process also leads to WFM. We corroborate the results by a numerical study of the correlations between the volumes of the minimizers, their Hessian, and their generalization performance on real data.
format Online
Article
Text
id pubmed-6955380
institution National Center for Biotechnology Information
language English
publishDate 2020
publisher National Academy of Sciences
record_format MEDLINE/PubMed
spelling pubmed-69553802020-01-14 Shaping the learning landscape in neural networks around wide flat minima Baldassi, Carlo Pittorino, Fabrizio Zecchina, Riccardo Proc Natl Acad Sci U S A PNAS Plus Learning in deep neural networks takes place by minimizing a nonconvex high-dimensional loss function, typically by a stochastic gradient descent (SGD) strategy. The learning process is observed to be able to find good minimizers without getting stuck in local critical points and such minimizers are often satisfactory at avoiding overfitting. How these 2 features can be kept under control in nonlinear devices composed of millions of tunable connections is a profound and far-reaching open question. In this paper we study basic nonconvex 1- and 2-layer neural network models that learn random patterns and derive a number of basic geometrical and algorithmic features which suggest some answers. We first show that the error loss function presents few extremely wide flat minima (WFM) which coexist with narrower minima and critical points. We then show that the minimizers of the cross-entropy loss function overlap with the WFM of the error loss. We also show examples of learning devices for which WFM do not exist. From the algorithmic perspective we derive entropy-driven greedy and message-passing algorithms that focus their search on wide flat regions of minimizers. In the case of SGD and cross-entropy loss, we show that a slow reduction of the norm of the weights along the learning process also leads to WFM. We corroborate the results by a numerical study of the correlations between the volumes of the minimizers, their Hessian, and their generalization performance on real data. National Academy of Sciences 2020-01-07 2019-12-23 /pmc/articles/PMC6955380/ /pubmed/31871189 http://dx.doi.org/10.1073/pnas.1908636117 Text en Copyright © 2020 the Author(s). Published by PNAS. https://creativecommons.org/licenses/by-nc-nd/4.0/ https://creativecommons.org/licenses/by-nc-nd/4.0/This open access article is distributed under Creative Commons Attribution-NonCommercial-NoDerivatives License 4.0 (CC BY-NC-ND) (https://creativecommons.org/licenses/by-nc-nd/4.0/) .
spellingShingle PNAS Plus
Baldassi, Carlo
Pittorino, Fabrizio
Zecchina, Riccardo
Shaping the learning landscape in neural networks around wide flat minima
title Shaping the learning landscape in neural networks around wide flat minima
title_full Shaping the learning landscape in neural networks around wide flat minima
title_fullStr Shaping the learning landscape in neural networks around wide flat minima
title_full_unstemmed Shaping the learning landscape in neural networks around wide flat minima
title_short Shaping the learning landscape in neural networks around wide flat minima
title_sort shaping the learning landscape in neural networks around wide flat minima
topic PNAS Plus
url https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6955380/
https://www.ncbi.nlm.nih.gov/pubmed/31871189
http://dx.doi.org/10.1073/pnas.1908636117
work_keys_str_mv AT baldassicarlo shapingthelearninglandscapeinneuralnetworksaroundwideflatminima
AT pittorinofabrizio shapingthelearninglandscapeinneuralnetworksaroundwideflatminima
AT zecchinariccardo shapingthelearninglandscapeinneuralnetworksaroundwideflatminima