Regularization in the Deep Learning World

Vidhya Chandrasekaran
7 min readFeb 17, 2019

Key objective of a Machine learning and deep learning model is to train in such a way that it reflects the predictions as close as possible to the ground truth. Optimization techniques help us to bring these two values ( Y and Yhat ) closer. It is also not an uncommon scenario when this objective is achieved with great success in the training phase but the performance falls down drastically in the test phase. Formally, the case when Yhat ( predicted value )and Y( ground truth )are not close and have difficultly in getting good results in the training phase itself, we say that the model has a “Bias” issue where the model is underfitting. But in a case when the training phase has a good performance but the falls down drastically in the test and validation phases, we say the model has a “Variance” issue or the model is overfitting. In the prior scenario, Datascientist attempt to correct it by applying various optimization techniques like Gradient descent, RMS prop, Adam Optimization . The later is treated with various Regularization techniques.

An underfit, overfit and perfect fit model

Data overfitting could be a result of insufficient data or also could be due to under sampling of the data. In such scenario, the model tries to memorizes the data and tries to fit it perfectly to that data set. For any new data, this memory does not work and hence the predictions deviate from the ground truth. In a Deep learning era, this is still very much relevant and important to regularize a model as so to not fall flat in a facial recognition or a speech recognition application. Or even to eliminate the various social bias ( Not the underfitting bias) from the models.

There are several ways the regularization is applied to Deep Learning Models and here are few:

Regularization with Penalty

Penalty based regularization is very popular way to regularize for several years now even before the emergence of Deep Learning models. The design of this regularization is to add a penalty term Lambda to the cost function. The value of Lambda could be a hyper parameter to tune. When Lambda is zero, it essentially means un-regularized models

i) L2 Regularization

L2 regularization when applied to linear models with least square cost function, it is also called as Ridge regression. It adds squared values of the coefficients to the cost function which penalizes whenever the weights try to deviate from the origin and thus avoids giving high importance to variables.

L2 penalty term — highlighted term

ii) L1 Regularization

Although not used as frequent as L2 regularization L1 is also a key regularization technique especially used as a feature selection method. While the penalty term in L2 is a squared coefficient values, L1 uses absolute values. This technique when used in Linear models are called as Lasso. This penalizes in a way it creates a sparse model and hence can be used as a feature selection method. Though some argue that it could make deep learning models faster due to having lesser variables this is typically not the use case ( Quote: AndrewNg in Deeplearning.ai in coursera )

Regularization with Noise

i) Noise in Inputs( Data Augmentation )

As we know that the overfitting problem is more often insufficient data problem, adding more data could help the model generalize better. Especially this is true in deep learning models as the performance is linear to the data volume used in training. However, in practical applications sometimes getting more data is very costly or just not possible. For example, consider a facial recognition system for a task. The data available for this specific task could be only what is actually available and limited. What could be possibly done in such scenarios? We could try several techniques like cropping, stretching, zooming, changing the angles etc and thus generating several image examples from one example. For instance, the below cat image is generated from 1 cat image. But one must take utmost care in doing so. We should refrain from transforming the data in a way it changes that could completely change the interpretation. For example, transforming the image of 6 to 9 by changing rotation could change the whole interpretation.

Data augmentation

For more such techniques, please refer https://towardsdatascience.com/image-augmentation-for-deep-learning-histogram-equalization-a71387f609b2

ii) Noise in Outputs( Label Smoothing )

This technique can be used where the confidence of have correct label values is lesser. This explicitly creates noise in the output, ie the target variable. This technique is also called as Label Smoothing. In many cases, we are not sure if the true labels are actually right. If they are wrong, basing that assumption to maximize the log probability could be dangerous. Using this method one can lower the expectation from 1 to 0.8 or 0.9 according to our assumption on the number of cases where the labels are incorrect.

iii) Noise in Hidden nodes ( Dropout )
Dropout
is a very popular and most used approach in regularization in deep learning. This approach aims at knocking off random units in the layers and eliminate the incoming and outgoing connections to and from this unit. This creates an element of randomization. This avoids having to rely or put more weight on one or two variables heavily. The models knows that some of the units might not be available during the next iterations and hence generalizes better than overfitting to the available data. In a deep learning model, we can assign probability at which the units are selected at each layer. Since there are multiple layers and each layer could be different size, we can assign different probability for each layer.

Image result for dropout regularization

Semi-supervised Techniques for Regularization

This is done by learning the representation of the data and to map a function h = f(x) . In this method, both the labeled and the unlabeled data is used for the estimation of y from x. Generative model of p(x) or p(x,y) can share the parameter with discriminative model of p(y|x)

Multi-task Learning

As we have seen, additional data to the training makes the model to generalize well, having model with common tasks also creates pressure on the model to generalizes well and to give more correct outputs. In the below diagram, the initial layers have shared parameters and the upper layers have more tasks specific parameters. The upper layer benefit only from the examples specific to the specific tasks to achieve good generalization and the initial layers benefit from the pooled data of all the tasks.

among the factors that explain the variations observed in the data associated with the different tasks, some are shared across two or more tasks

Early Stopping

We know that the training error decreases with every iteration when used in conjunction with optimization technique like gradient descent. We also know that the validation error also decreases but begins to increase after a certain point. This is the point of best performance of the model and we should stop it early at this point instead of going through the entire cycle until training error stops to decline. This operates by storing the most recent best performing parameters in memory and also having a “tolerance” level to terminate . ie if the tolerance level is set to be 3, each iterations if the performance of the parameters is better than the previous, we store them in memory but in cases where the performance is not better we do not update the parameters and also not terminate until the model does not improve for 3 iterations in which case the model terminates and returns the most recent value from the memory.

Ensemble models

Ensemble models aims at building multiple models and averaging the outputs to avoid overfitting. This is either achieved by

  1. Building multiple models
  2. Building same model with different variables and samples

This randomization helps to achieve better generalization when compared to relying on one independent model

Hope this gives you a very high level understanding of various regularization techniques and also what is relevant to a particular problem.

References: Deep Learning by Iam Goodfellow, Yoshua Bengio, Aaron Courvilee and Andrew Ng’s deeplearning.ai coursera courses.

--

--