# Paper reading - ADADELTA AN ADAPTIVE LEARNING RATE METHOD

This paper was done by Matthew D. Zeiler while he was an intern at Google.

### Introduction

The aim of many machine learning methods is to update a set of parameters $x$ in order to optimize an objective function $f(x)$. This often involves some iterative procedure which applies changes to the parameters, $\Delta{x}$ at each iteration of the algorithm. Denoting the parameters at the t-th iteration as $x_t$, this simple update rule becomes:

- $g_t$ is the gradient of the parameters at the t-th iteration
- $η$ is a learning rate which controls how large of a step to take in the direction of the negative gradient

### Purpose

The idea presented in this paper was derived from ADAGRAD in order to improve upon the two main drawbacks of the method:

- the continual decay of learning rates throughout training
- the need for a manually selected global learning rate.

### SGD vs ADAGRAD vs ADADELTA

- SGD:
- where $\rho$ is a constant controlling the decay of the previous parameter updates

- ADAGRAD:
- ADADELTA:
- where a constant $\epsilon$ is added to better condition the denominator
- where $E[g^2]_t$ is expected value of gradient with power 2 at time t

### Result

Compared with SGD, ADAGRAD and MOMENTUM, normally ADADELTA has a convergence faster and has lower error rate.

### Personal Thought

Have tried ADADELTA and SGD. Although for each epoch ADADELTA takes longer time to compute, we just have to input (default value) $\rho = 0.95$ and $\epsilon = 1e^{-6}$ then it will learn very well. If use SGD, we have to fine tune the learning rate and the error rate is often bigger than ADADELTA.