Regularization and Dropout

Regularization and Dropout
Nov 21, 2016

(WIP)

Useful Resources

My Presentation Slides

Dropout as Regularization

Introduction

One of the important challenges in the use of neural networks is generalization. Since we have a huge hypothesis space in neural networks, maximum likelihood estimation of parameters almost always suffers over-fitting. The most popular workaround to this problem is dropout ¹. Though it is clear that it causes the network to fit less to the training data, it is not clear at all what is the mechanism behind the dropout method and how it is linked to our classical methods, such as L-2 norm regularization and Lasso. With regards to this theoretical issue, Wager et al. ² present a new view on dropout when applied to Generalized Linear Models. They view dropout as a process that artificially corrupts data with multiplicative Bernoullie noise. They prove that dropout when applied to Generalized Linear Models is nothing but adding adaptive L-2 norm regularization term (penalty term) to the negative log likelihood function up to the quadratic approximation. It turns out that the L-2 norm penalty term works in a way that favors rare but discriminative features.

Background

Generalized Linear Models and Exponential Family Distributions

The exponential family distribution is defined as the following:

$p(y ; \theta) = h(y)exp(\theta^T T(y) - A(\theta))$

And the Generalized Linear Model assumption is to set $\theta= \beta^T x$ where $x$ is an explanatory variable. $\theta$ is called the canonical parameter. The function $A$ is a normalizer, and

$\nabla A (\theta) = E[T(X)]\\ \nabla^2 A(\theta) = Cov[T(X)]$

You can prove this, and this will come in handy later. For those of you who are interested in more details, see Andrew Ng’s Notes. Assuming that we have independent samples in the training set, the negative log likelihood is

$l(\beta) = - \sum\limits_i log p(x^{(i)} \vert y^{(i)} ; \beta)$

and

$\DeclareMathOperator*{\argmin}{argmin} \DeclareMathOperator*{\argmax}{argmax} \hat{\beta}_{MLE} = \argmin\limits_\beta l(\beta)$

Robustness and Constrained Optimization

It is a terrible idea to fit a nth-degree polynomial to n data points. It does a perfect job on the given data points, but it fails to generalize to new data points. over-fitting. Putting it in a more statistical language, the parameter estimation varies too much according to a particular set of realizations from our true distribution. This insight encourages us to constrain possible values that the parameter can take on so that the parameter estimate would not vary so much based on the realizations we get to observe. L-1 and L-2 norm constraints on the estimated parameter are both common choices for the constraint. This is the trade-off between bias and variance of a model. For instance, we can pose an optimization problem constrained by the L-2 norm regularization instead of the normal linear regression problem.

$\min\limits_\beta \lVert y - X\beta \rVert_2^2 \\ \text{ such that } \lVert \beta \rVert_2^2 \leq s$

However, by the strong duality of the convex problem, we can solve the dual problem:

$\min\limits_\beta \lVert y - X\beta \rVert_2^2 + \lambda \lVert \beta \rVert_2^2$ where there is a one-to-one relationship between $\lambda$ and $s$.

Adaptive Regularization and Dropout

The vanilla regularization scheme, such as Lasso and Ridge Regression, penalizes big parameters uniformly. Namely, our constraint is solely based on the parameter. However, sometimes, it might be useful to prefer some features than the other. For instance, when recognizing a handwritten digit in the MNIST dataset, we might want to look at rare features that are specific to each number. In such cases, adaptive regularization comes into play. Wager et al.² proves that dropout works in a way that adaptive regularization happens. First, let’s think of the dropout training as a process artificially corrupting the data with noise.

$\widetilde x^{(i)} = x^{(i)} \odot \xi_i \quad where \quad \xi_i \sim \frac{Bern(1-\delta)}{1-\delta}$

Then, the negative log likelihood (the cost function) becomes:

$l(\beta) = \sum_{i=1}^m - log h(y^{(i)}) + A(\beta^T x^{(i)}) - (\beta^T x^{(i)} ) y^{(i)} \\ E \widetilde l(\beta) = E\left[\sum_{i=1}^m - log h(y^{(i)}) + A(\beta^T \widetilde x^{(i)}) - (\beta^T \widetilde x^{(i)} ) y^{(i)}\right]\\ = l(\beta) + \sum_{i=1}^m E[A(\beta^T \widetilde x^{(i)} ] - A(\beta^T x(i))\\ = l(\beta) + R(\beta)$

We can approximate the R penalty term up to the quadratic Taylor expansion.

$R(\beta) = \sum_{i=1}^m E[A(\beta^T \widetilde x^{(i)})] - A(\beta^T x^{(i)}) \\ \approx \sum_{i=1}^m A^{\prime} (\beta^T x^{(i)}) \beta^T E(\widetilde x^{(i)}- x^{(i)}) + \frac{1}{2} A^{''}(\beta^T x^{(i)}) Var(\beta^T \widetilde x^{(i)}) \\ = \sum_{i=1}^m \frac{1}{2} A^{''}(\beta^T x^{(i)}) Var(\beta^T \widetilde x^{(i)}) \doteq R^q(\beta)$

Now recall that $A^{‘’} = Cov(T(X))$ So for the logistic regression (Bernoulli distribution),

$R^q(\beta) = \sum_{i=1}^m \frac{1}{2} Var[\beta^T \widetilde x^{(i)}]\sigma(\beta^T x^{(i)})(1-\sigma(\beta^T x^{(i)})) \\ = \sum_{i=1}^m \frac{1}{2} Var[\beta^T \widetilde x^{(i)}]p^{(i)} (1-p^{(i)})$

where $\sigma$ denotes the sigmoid function.

Also,

$Var(\xi) = \frac{\delta (1-\delta)}{(1-\delta)^2} I = \frac{\delta}{1-\delta}I\\ Var(\widetilde x^{(i)}) = Var[x^{(i)} \odot \xi] \\ = \frac{\delta}{1-\delta} diag\left(x_j^{(i)^2}\right)Var(\beta^T \widetilde x^{(i)}) \\ = \beta^T Var(\widetilde x^{(i)}) \beta\\ = \frac{\delta}{1-\delta} \sum_{j=1}^d x_j^{(i)^2} \beta_j^2$

So intuitively, the penalty term penalizes the model when the model is not confident and the corresponding feature fires often. Namely, the dropout training favors rare but discriminative features.

References

https://arxiv.org/pdf/1207.0580.pdf ↩
http://papers.nips.cc/paper/4882-dropout-training-as-adaptive-regularization.pdf ↩ ↩²

neural networks, learning theory

Home

Yale Data Science