Regularization is a vital tool in machine learning to prevent overfitting and foster generalization ability. This chapter introduces the concept of regularization and discusses common regularization techniques in more depth.
In linear regression if $p$ is large enough and $n$ is small enough, LR can also overfit. Moreover OLS Estimator requires a full rank design matrix, and for highly correlated features, OLS becomes sensitive to random errors resulting in a large variance.
\[\hat{\theta}_{Ridge} = \underset{\theta}{\operatorname{argmin}} \Sigma_{i=1}^n (y^{(i)} - \theta^T x^{(i)})^2 + \lambda \Sigma_{j=1}^p \theta_j^2 = ||y - \theta^Tx||_2^2 + \lambda ||\theta||_2^2\] \[\hat{\theta}_{Ridge} = (X^TX + \lambda I)^{-1}X^Ty\]Another shrinkage method is the so-called LASSO (least absolute shrinkage and selection operator) which uses the L1 penalty on $\theta$:
\[\begin{align*} \hat{\theta}_{\text{lasso}} &= \underset{\theta}{\operatorname{argmin}} \sum_{i=1}^n (y^{(i)} - \theta^T x^{(i)})^2 + \lambda \sum_{j=1}^p |\theta_j| \\ &= \underset{\theta}{\operatorname{argmin}} \| y - X\theta \|_2^2 + \lambda \| \theta \|_1 \end{align*}\]Suppose two variables are highly correlated. Then Ridge regression has tends to have “grouping effect” i.e. similar effect on both variables where LASSO arbitrarily decides to select one of them. If $x_i = x_j$, then $\theta_{i,Ridge} = \theta_{j,Ridge}$ because the $L2$ penalty is strictly convex. Since $L1$ is not strictly convex, sum of coefficents can be arbitrarily allocated across both features i.e.
\[\hat{\theta}_i = s.(\hat{\theta}_i + \theta_j) \text{ and } \hat{\theta}_j = (1-s) (\hat{\theta}_i + \theta_j) \text{ for } s \in [0,1]\]It is a “compromise” between L1 and L2 penalties:
\[\begin{align*} \hat{\theta}_{\text{elnet}} &= \underset{\theta}{\operatorname{argmin}} \sum_{i=1}^n (y^{(i)} - \theta^T x^{(i)})^2 + \lambda_1 \| \theta \|_2^2 + \lambda_2 \| \theta \|_1 \\ &= \underset{\theta}{\operatorname{argmin}} \| y - X\theta \|_2^2 + \lambda \alpha \| \theta \|_2^2 + \lambda (1 - \alpha) \| \theta \|_1 \end{align*}\]We can summarise regularised risk minimisation using this equation:
\[\min_\theta R_{reg}(\theta) = \min_\theta(\Sigma_{i=1}^n L(y^{(i)}, f(x^{(i)}| \theta)) + \lambda J(\theta)\]We have previously created an equivalence between MLE and ERM. We can do the same for MAP and RRM. Assume we have a parametrised distribution $p(y \mid \theta, x)$ for our data and a prior distribution $q(\theta)$ over our parameter space, then, as per Bayes rule:
\[p(\theta | x, y) = \frac{p(y|\theta,x) q(\theta)}{p(y|x)} \propto p(y|\theta, x) q(\theta)\]The Maximum Priori Estimator is then given by:
\[\begin{align*} \hat{\theta}_{\text{MAP}} &= \underset{\theta}{\operatorname{argmax}}\, p(y \mid \theta, x)\, q(\theta) \\ &= \underset{\theta}{\operatorname{argmax}}\, \log p(y \mid \theta, x) + \log q(\theta) \\ &= \underset{\theta}{\operatorname{argmin}}\, -\log p(y \mid \theta, x) - \log q(\theta) \end{align*}\]Consider $L2$ regularisation:
\[\min_\theta R_{reg}(\theta) = \min_\theta L(y, f(x|\theta)) + \frac{\lambda}{2} ||\theta||_2^2 = \min_\theta R_{emp}(\theta) + \frac{\lambda}{2} ||\theta||_2^2\]If we now update $\theta$ using gradient descent, then the gradient is:
\[\nabla_\theta R_{reg}(\theta) = \nabla_\theta R_{emp}(\theta) + \lambda \theta\]Thus we get the following update:
\[\theta^{new} = \theta^{old} - \alpha(\nabla_\theta R_{emp}(\theta^{old}) + \lambda \theta^{old}) \\ \theta^{old} (1 - \alpha \lambda) - \alpha \nabla_\theta R_{emp}(\theta^{old})\]Usually both $\alpha \lambda « 1$. Therefore essentially what is happening is the old parameter is decayed in magnitude before performing an update. Thus this well known technique of weight decay is simply $L2$ regularisation in disguise. Note that this equivalence only holds for (stochastic) gradient descent and not for Adam, etc.