Cargando…

On Scalable Deep Learning and Parallelizing Gradient Descent

Speeding up gradient based methods has been a subject of interest over the past years with many practical applications, especially with respect to Deep Learning. Despite the fact that many optimizations have been done on a hardware level, the convergence rate of very large models remains problematic...

Descripción completa

Detalles Bibliográficos
Autor principal:	Hermans, Joeri
Lenguaje:	eng
Publicado:	2017
Materias:	Computing and Computers
Acceso en línea:	http://cds.cern.ch/record/2276711

_version_	1780955206391955456
author	Hermans, Joeri
author_facet	Hermans, Joeri
author_sort	Hermans, Joeri
collection	CERN
description	Speeding up gradient based methods has been a subject of interest over the past years with many practical applications, especially with respect to Deep Learning. Despite the fact that many optimizations have been done on a hardware level, the convergence rate of very large models remains problematic. Therefore, data parallel methods next to mini-batch parallelism have been suggested to further decrease the training time of parameterized models using gradient based methods. Nevertheless, asynchronous optimization was considered too unstable for practical purposes due to a lacking understanding of the underlying mechanisms. Recently, a theoretical contribution has been made which defines asynchronous optimization in terms of (implicit) momentum due to the presence of a queuing model of gradients based on past parameterizations. This thesis mainly builds upon this work to construct a better understanding why asynchronous optimization shows proportionally more divergent behavior when the number of parallel workers increases, and how this affects existing distributed optimization algorithms. Furthermore, using our redefinition of parameter staleness, we construct two novel techniques for asynchronous optimization, i.e., AGN and ADAG. This work shows that these methods outperform existing methods, and are more robust to (distributed) hyperparameterization contrary to existing distributed optimization algorithms such as DOWNPOUR, (A)EASGD, and DynSGD. Additionally, this thesis presents several smaller contributions. First, we show that the convergence rate of EASGD derived algorithms is impaired by an equilibrium condition. However, this equilibrium condition makes sure that EASGD does not overfit quickly. Finally, we introduce a new metric, temporal efficiency, to evaluate distributed optimization algorithms against each other.
id	cern-2276711
institution	Organización Europea para la Investigación Nuclear
language	eng
publishDate	2017
record_format	invenio
spelling	cern-22767112019-09-30T06:29:59Zhttp://cds.cern.ch/record/2276711engHermans, JoeriOn Scalable Deep Learning and Parallelizing Gradient DescentComputing and ComputersSpeeding up gradient based methods has been a subject of interest over the past years with many practical applications, especially with respect to Deep Learning. Despite the fact that many optimizations have been done on a hardware level, the convergence rate of very large models remains problematic. Therefore, data parallel methods next to mini-batch parallelism have been suggested to further decrease the training time of parameterized models using gradient based methods. Nevertheless, asynchronous optimization was considered too unstable for practical purposes due to a lacking understanding of the underlying mechanisms. Recently, a theoretical contribution has been made which defines asynchronous optimization in terms of (implicit) momentum due to the presence of a queuing model of gradients based on past parameterizations. This thesis mainly builds upon this work to construct a better understanding why asynchronous optimization shows proportionally more divergent behavior when the number of parallel workers increases, and how this affects existing distributed optimization algorithms. Furthermore, using our redefinition of parameter staleness, we construct two novel techniques for asynchronous optimization, i.e., AGN and ADAG. This work shows that these methods outperform existing methods, and are more robust to (distributed) hyperparameterization contrary to existing distributed optimization algorithms such as DOWNPOUR, (A)EASGD, and DynSGD. Additionally, this thesis presents several smaller contributions. First, we show that the convergence rate of EASGD derived algorithms is impaired by an equilibrium condition. However, this equilibrium condition makes sure that EASGD does not overfit quickly. Finally, we introduce a new metric, temporal efficiency, to evaluate distributed optimization algorithms against each other.CERN-THESIS-2017-103oai:cds.cern.ch:22767112017-08-02T14:58:26Z
spellingShingle	Computing and Computers Hermans, Joeri On Scalable Deep Learning and Parallelizing Gradient Descent
title	On Scalable Deep Learning and Parallelizing Gradient Descent
title_full	On Scalable Deep Learning and Parallelizing Gradient Descent
title_fullStr	On Scalable Deep Learning and Parallelizing Gradient Descent
title_full_unstemmed	On Scalable Deep Learning and Parallelizing Gradient Descent
title_short	On Scalable Deep Learning and Parallelizing Gradient Descent
title_sort	on scalable deep learning and parallelizing gradient descent
topic	Computing and Computers
url	http://cds.cern.ch/record/2276711
work_keys_str_mv	AT hermansjoeri onscalabledeeplearningandparallelizinggradientdescent

On Scalable Deep Learning and Parallelizing Gradient Descent

Ejemplares similares