Cargando…

On Scalable Deep Learning and Parallelizing Gradient Descent

Speeding up gradient based methods has been a subject of interest over the past years with many practical applications, especially with respect to Deep Learning. Despite the fact that many optimizations have been done on a hardware level, the convergence rate of very large models remains problematic...

Descripción completa

Detalles Bibliográficos
Autor principal: Hermans, Joeri
Lenguaje:eng
Publicado: 2017
Materias:
Acceso en línea:http://cds.cern.ch/record/2276711
_version_ 1780955206391955456
author Hermans, Joeri
author_facet Hermans, Joeri
author_sort Hermans, Joeri
collection CERN
description Speeding up gradient based methods has been a subject of interest over the past years with many practical applications, especially with respect to Deep Learning. Despite the fact that many optimizations have been done on a hardware level, the convergence rate of very large models remains problematic. Therefore, data parallel methods next to mini-batch parallelism have been suggested to further decrease the training time of parameterized models using gradient based methods. Nevertheless, asynchronous optimization was considered too unstable for practical purposes due to a lacking understanding of the underlying mechanisms. Recently, a theoretical contribution has been made which defines asynchronous optimization in terms of (implicit) momentum due to the presence of a queuing model of gradients based on past parameterizations. This thesis mainly builds upon this work to construct a better understanding why asynchronous optimization shows proportionally more divergent behavior when the number of parallel workers increases, and how this affects existing distributed optimization algorithms. Furthermore, using our redefinition of parameter staleness, we construct two novel techniques for asynchronous optimization, i.e., AGN and ADAG. This work shows that these methods outperform existing methods, and are more robust to (distributed) hyperparameterization contrary to existing distributed optimization algorithms such as DOWNPOUR, (A)EASGD, and DynSGD. Additionally, this thesis presents several smaller contributions. First, we show that the convergence rate of EASGD derived algorithms is impaired by an equilibrium condition. However, this equilibrium condition makes sure that EASGD does not overfit quickly. Finally, we introduce a new metric, temporal efficiency, to evaluate distributed optimization algorithms against each other.
id cern-2276711
institution Organización Europea para la Investigación Nuclear
language eng
publishDate 2017
record_format invenio
spelling cern-22767112019-09-30T06:29:59Zhttp://cds.cern.ch/record/2276711engHermans, JoeriOn Scalable Deep Learning and Parallelizing Gradient DescentComputing and ComputersSpeeding up gradient based methods has been a subject of interest over the past years with many practical applications, especially with respect to Deep Learning. Despite the fact that many optimizations have been done on a hardware level, the convergence rate of very large models remains problematic. Therefore, data parallel methods next to mini-batch parallelism have been suggested to further decrease the training time of parameterized models using gradient based methods. Nevertheless, asynchronous optimization was considered too unstable for practical purposes due to a lacking understanding of the underlying mechanisms. Recently, a theoretical contribution has been made which defines asynchronous optimization in terms of (implicit) momentum due to the presence of a queuing model of gradients based on past parameterizations. This thesis mainly builds upon this work to construct a better understanding why asynchronous optimization shows proportionally more divergent behavior when the number of parallel workers increases, and how this affects existing distributed optimization algorithms. Furthermore, using our redefinition of parameter staleness, we construct two novel techniques for asynchronous optimization, i.e., AGN and ADAG. This work shows that these methods outperform existing methods, and are more robust to (distributed) hyperparameterization contrary to existing distributed optimization algorithms such as DOWNPOUR, (A)EASGD, and DynSGD. Additionally, this thesis presents several smaller contributions. First, we show that the convergence rate of EASGD derived algorithms is impaired by an equilibrium condition. However, this equilibrium condition makes sure that EASGD does not overfit quickly. Finally, we introduce a new metric, temporal efficiency, to evaluate distributed optimization algorithms against each other.CERN-THESIS-2017-103oai:cds.cern.ch:22767112017-08-02T14:58:26Z
spellingShingle Computing and Computers
Hermans, Joeri
On Scalable Deep Learning and Parallelizing Gradient Descent
title On Scalable Deep Learning and Parallelizing Gradient Descent
title_full On Scalable Deep Learning and Parallelizing Gradient Descent
title_fullStr On Scalable Deep Learning and Parallelizing Gradient Descent
title_full_unstemmed On Scalable Deep Learning and Parallelizing Gradient Descent
title_short On Scalable Deep Learning and Parallelizing Gradient Descent
title_sort on scalable deep learning and parallelizing gradient descent
topic Computing and Computers
url http://cds.cern.ch/record/2276711
work_keys_str_mv AT hermansjoeri onscalabledeeplearningandparallelizinggradientdescent