Overfitting occurs once you achieve an honest fit of your model on the training data, but it doesn’t generalize well on new, unseen data. In other words, the model learned patterns specific to the training data, which are irrelevant in other data.
We can identify overfitting by watching validation metrics like loss or accuracy. Usually, the validation metric stops improving after a particular number of epochs and begins to decrease afterward. The training metric continues to enhance because the model seeks to seek out the simplest fit the training data.
There are several manners during which we will reduce overfitting in deep learning models. the simplest option is to urge more training data. Unfortunately, in real-world situations, you regularly don’t have this possibility thanks to time, budget, or technical constraints.
Another way to scale back overfitting is to lower the capacity of the model to memorize the training data. As such, the model will got to specialise in the relevant patterns within the training data, which ends up in better generalization. during this post, we’ll discuss three options to realize this.
Set up the project
We start by importing the required packages and configuring some parameters. we’ll use Keras to suit the deep learning models. The training data is that the Twitter US Airline Sentiment data set from Kaggle.
Some helper functions
We will use some helper functions throughout this post
We load the CSV with the tweets and perform a random shuffle. It’s an honest practice to shuffle the info before splitting between a train and test set. That way the sentiment classes are equally distributed over the train and test sets. We’ll only keep the text column as input and therefore the airline_sentiment column because the target.
The next thing we’ll do is take away stopwords. Stopwords don’t have any value for predicting the sentiment. Furthermore, as we would like to create a model which will be used for other airline companies also , we remove the mentions.
The evaluation of the model performance must be done on a separate test set. As such, we will estimate how well the model generalizes. this is often through with the train_test_split method of scikit-learn.
Converting words to numbers
To use the text as input for a model, we first got to convert the words into tokens, which simply means converting the words into integers that ask an index during a dictionary. Here we’ll only keep the foremost frequent words within the training set.
We pack up the text by applying filters and putting the words to lowercase. Words are separated by spaces.
After having created the dictionary we will convert the text of a tweet to a vector with NB_WORDS values. With mode=binary, it contains an indicator whether the word appeared within the tweet or not. this is often through with the texts_to_matrix method of the Tokenizer.
Converting the target classes to numbers
We need to convert the target classes to numbers also , which successively are one-hot-encoded with the to_categorical method in Keras.We need to convert the target classes to numbers also , which successively are one-hot-encoded with the to_categorical method in Keras.
Splitting off a validation set
Now that our data is prepared , we split off a validation set. This validation set are going to be wont to evaluate the model performance once we tune the parameters of the model.
Creating a model that overfits
We start with a model that overfits. it’s 2 densely connected layers of 64 elements. The input_shape for the primary layer is adequate to the amount of words we kept within the dictionary and that we created one-hot-encoded features.
As we’d like to predict 3 different sentiment classes, the last layer has 3 elements. The softmax activation function makes sure the three probabilities sum up to 1.
The number of parameters to coach is computed as (nb inputs x nb elements in hidden layer) + nb bias terms. the amount of inputs for the primary layer equals the amount of words in our corpus. the next layers have the amount of outputs of the previous layer as inputs. therefore the number of parameters per layer are:
- First layer : (10000 x 64) + 64 = 640064
- Second layer : (64 x 64) + 64 = 4160
- Last layer : (64 x 3) + 3 = 195
Because this project may be a multi-class, single-label prediction, we use categorical_crossentropy because the loss function and softmax because the final activation function. We fit the model on the train data and validate on the validation set. We run a predetermined number of epochs and can see when the model starts to overfit.
In the beginning, the validation loss goes down. But at epoch 3 this stops and therefore the validation loss starts increasing rapidly. this is often when the models begin to overfit.
The training loss continues to travel down and almost reaches zero at epoch 20. this is often normal because the model is trained to suit the train data also as possible.
Now, we will attempt to do something about the overfitting. There are different options to try to to that.
- Reduce the network’s capacity by removing layers or reducing the amount of elements within the hidden layers
- Apply regularization, which comes right down to adding a price to the loss function for giant weights
- Use Dropout layers, which can randomly remove certain features by setting them to zero
Reducing the network’s capacity
Our first model features a sizable amount of trainable parameters. the upper this number, the better the model can memorize the target class for every training sample. Obviously, this is often not ideal for generalizing on new data.
By lowering the capacity of the network, you force it to find out the patterns that matter or that minimize the loss. On the opposite hand, reducing the network’s capacity an excessive amount of will cause underfitting. The model won’t be ready to learn the relevant patterns within the train data.
We reduce the network’s capacity by removing one hidden layer and lowering the amount of elements within the remaining layer to 16.
We can see that it takes more epochs before the reduced model starts overfitting. The validation loss also goes up slower than our first model.
When we compare the validation loss of the baseline model, it’s clear that the reduced model starts overfitting at a later epoch. The validation loss stays lower for much longer than the baseline model.
To address overfitting, we will apply weight regularization to the model. this may add a price to the loss function of the network for giant weights (or parameter values). As a result, you get an easier model which will be forced to find out only the relevant patterns within the train data.
There are L1 regularization and L2 regularization.
- L1 regularization will add a price with regards to absolutely the value of the parameters. it’ll end in a number of the weights to be adequate to zero.
- L2 regularization will add a price with regards to the squared value of the parameters. This leads to smaller weights.
Let’s try with L2 regularization.
For the regularized model we notice that it starts overfitting in the same epoch as the baseline model. However, the loss increases much slower afterward.
Adding dropout layers
The last option we’ll try is to add dropout layers. A dropout layer will randomly set output features of a layer to zero.
The model with dropout layers starts overfitting later than the baseline model. The loss also increases slower than the baseline model.
The model with the dropout layers starts overfitting later. Compared to the baseline model the loss also remains much lower.
Training on the full train data and evaluation on test data
At first sight, the reduced model seems to be the best model for generalization. But let’s check that on the test set.
As shown above, all three options help to reduce overfitting. We manage to increase the accuracy on the test data substantially. Among these three options, the model with the dropout layers performs the best on the test data