How to Handle Overfitting in Deep Learning Models

Overfitting occurs once you achieve an honest fit of your model on the training data, but it doesn’t generalize well on new, unseen data. In other words, the model learned patterns specific to the training data, which are irrelevant in other data.

We can identify overfitting by watching validation metrics like loss or accuracy. Usually, the validation metric stops improving after a particular number of epochs and begins to decrease afterward. The training metric continues to enhance because the model seeks to seek out the simplest fit the training data.

There are several manners during which we will reduce overfitting in deep learning models. the simplest option is to urge more training data. Unfortunately, in real-world situations, you regularly don’t have this possibility thanks to time, budget, or technical constraints.

Another way to scale back overfitting is to lower the capacity of the model to memorize the training data. As such, the model will got to specialise in the relevant patterns within the training data, which ends up in better generalization. during this post, we’ll discuss three options to realize this.

Set up the project

We start by importing the required packages and configuring some parameters. we’ll use Keras to suit the deep learning models. The training data is that the Twitter US Airline Sentiment data set from Kaggle.

# Basic packages
import pandas as pd 
import numpy as np
import re
import collections
import matplotlib.pyplot as plt
from pathlib import Path
# Packages for data preparation
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from keras.preprocessing.text import Tokenizer
from keras.utils.np_utils import to_categorical
from sklearn.preprocessing import LabelEncoder
# Packages for modeling
from keras import models
from keras import layers
from keras import regularizers
NB_WORDS = 10000  # Parameter indicating the number of words we'll put in the dictionary
NB_START_EPOCHS = 20  # Number of epochs we usually start to train with
BATCH_SIZE = 512  # Size of the batches used in the mini-batch gradient descent
MAX_LEN = 20  # Maximum number of words in a sequence
root = Path('../')
input_path = root / 'input/' 
ouput_path = root / 'output/'
source_path = root / 'source/'

Some helper functions

We will use some helper functions throughout this post

def deep_model(model, X_train, y_train, X_valid, y_valid):
    Function to train a multi-class model. The number of epochs and 
    batch_size are set by the constants at the top of the
        model : model with the chosen architecture
        X_train : training features
        y_train : training target
        X_valid : validation features
        Y_valid : validation target
        model training history
                  , loss='categorical_crossentropy'
                  , metrics=['accuracy'])
    history =
                       , y_train
                       , epochs=NB_START_EPOCHS
                       , batch_size=BATCH_SIZE
                       , validation_data=(X_valid, y_valid)
                       , verbose=0)
    return history
def eval_metric(model, history, metric_name):
    Function to evaluate a trained model on a chosen metric. 
    Training and validation metric are plotted in a
    line chart for each epoch.
        history : model training history
        metric_name : loss or accuracy
        line chart with epochs of x-axis and metric on
    metric = history.history[metric_name]
    val_metric = history.history['val_' + metric_name]
    e = range(1, NB_START_EPOCHS + 1)
    plt.plot(e, metric, 'bo', label='Train ' + metric_name)
    plt.plot(e, val_metric, 'b', label='Validation ' + metric_name)
    plt.xlabel('Epoch number')
    plt.title('Comparing training and validation ' + metric_name + ' for ' +
def test_model(model, X_train, y_train, X_test, y_test, epoch_stop):
    Function to test the model on new data after training it
    on the full training data with the optimal number of epochs.
        model : trained model
        X_train : training features
        y_train : training target
        X_test : test features
        y_test : test target
        epochs : optimal number of epochs
        test accuracy and test loss
              , y_train
              , epochs=epoch_stop
              , batch_size=BATCH_SIZE
              , verbose=0)
    results = model.evaluate(X_test, y_test)
    print('Test accuracy: {0:.2f}%'.format(results[1]*100))
    return results
def remove_stopwords(input_text):
    Function to remove English stopwords from a Pandas Series.
        input_text : text to clean
        cleaned Pandas Series 
    stopwords_list = stopwords.words('english')
    # Some words which might indicate a certain sentiment are kept via a whitelist
    whitelist = ["n't", "not", "no"]
    words = input_text.split() 
    clean_words = [word for word in words if (word not in stopwords_list or word in whitelist) and len(word) > 1] 
    return " ".join(clean_words) 
def remove_mentions(input_text):
    Function to remove mentions, preceded by @, in a Pandas Series
        input_text : text to clean
        cleaned Pandas Series 
    return re.sub(r'@\w+', '', input_text)
def compare_models_by_metric(model_1, model_2, model_hist_1, model_hist_2, metric):
    Function to compare a metric between two models 
        model_hist_1 : training history of model 1
        model_hist_2 : training history of model 2
        metrix : metric to compare, loss, acc, val_loss or val_acc
        plot of metrics of both models
    metric_model_1 = model_hist_1.history[metric]
    metric_model_2 = model_hist_2.history[metric]
    e = range(1, NB_START_EPOCHS + 1)
    metrics_dict = {
        'acc' : 'Training Accuracy',
        'loss' : 'Training Loss',
        'val_acc' : 'Validation accuracy',
        'val_loss' : 'Validation loss'
    metric_label = metrics_dict[metric]
    plt.plot(e, metric_model_1, 'bo',
    plt.plot(e, metric_model_2, 'b',
    plt.xlabel('Epoch number')
    plt.title('Comparing ' + metric_label + ' between models')
def optimal_epoch(model_hist):
    Function to return the epoch number where the validation loss is
    at its minimum
        model_hist : training history of model
        epoch number with minimum validation loss
    min_epoch = np.argmin(model_hist.history['val_loss']) + 1
    print("Minimum validation loss reached in epoch {}".format(min_epoch))
    return min_epoch

Data preparation

Data cleaning

We load the CSV with the tweets and perform a random shuffle. It’s an honest practice to shuffle the info before splitting between a train and test set. That way the sentiment classes are equally distributed over the train and test sets. We’ll only keep the text column as input and therefore the airline_sentiment column because the target.

READ  Indentation in Python with Examples

The next thing we’ll do is take away stopwords. Stopwords don’t have any value for predicting the sentiment. Furthermore, as we would like to create a model which will be used for other airline companies also , we remove the mentions.

df = pd.read_csv(input_path / 'Tweets.csv')
df = df.reindex(np.random.permutation(df.index))  
df = df[['text', 'airline_sentiment']]
df.text = df.text.apply(remove_stopwords).apply(remove_mentions)

Train-Test split

The evaluation of the model performance must be done on a separate test set. As such, we will estimate how well the model generalizes. this is often through with the train_test_split method of scikit-learn.

X_train, X_test, y_train, y_test = train_test_split(df.text, df.airline_sentiment, test_size=0.1, random_state=37)

Converting words to numbers

To use the text as input for a model, we first got to convert the words into tokens, which simply means converting the words into integers that ask an index during a dictionary. Here we’ll only keep the foremost frequent words within the training set.

We pack up the text by applying filters and putting the words to lowercase. Words are separated by spaces.

tk = Tokenizer(num_words=NB_WORDS,
               filters='!"#$%&()*+,-./:;<=>[email protected][\\]^_`{"}~\t\n',
               split=' ')


After having created the dictionary we will convert the text of a tweet to a vector with NB_WORDS values. With mode=binary, it contains an indicator whether the word appeared within the tweet or not. this is often through with the texts_to_matrix method of the Tokenizer.

X_train_oh = tk.texts_to_matrix(X_train, mode='binary')
X_test_oh = tk.texts_to_matrix(X_test, mode='binary')

Converting the target classes to numbers

We need to convert the target classes to numbers also , which successively are one-hot-encoded with the to_categorical method in Keras.We need to convert the target classes to numbers also , which successively are one-hot-encoded with the to_categorical method in Keras.

le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_test_le = le.transform(y_test)
y_train_oh = to_categorical(y_train_le)
y_test_oh = to_categorical(y_test_le)

Splitting off a validation set

Now that our data is prepared , we split off a validation set. This validation set are going to be wont to evaluate the model performance once we tune the parameters of the model.

X_train_rest, X_valid, y_train_rest, y_valid = train_test_split(X_train_oh, y_train_oh, test_size=0.1, random_state=37)

Deep learning

Creating a model that overfits

We start with a model that overfits. it’s 2 densely connected layers of 64 elements. The input_shape for the primary layer is adequate to the amount of words we kept within the dictionary and that we created one-hot-encoded features.

READ  The Python Modulo Operator - What Does the % Symbol Mean in Python?

As we’d like to predict 3 different sentiment classes, the last layer has 3 elements. The softmax activation function makes sure the three probabilities sum up to 1.

The number of parameters to coach is computed as (nb inputs x nb elements in hidden layer) + nb bias terms. the amount of inputs for the primary layer equals the amount of words in our corpus. the next layers have the amount of outputs of the previous layer as inputs. therefore the number of parameters per layer are:

  • First layer : (10000 x 64) + 64 = 640064
  • Second layer : (64 x 64) + 64 = 4160
  • Last layer : (64 x 3) + 3 = 195
base_model = models.Sequential()
base_model.add(layers.Dense(64, activation='relu', input_shape=(NB_WORDS,)))
base_model.add(layers.Dense(64, activation='relu'))
base_model.add(layers.Dense(3, activation='softmax')) = 'Baseline model'


Because this project may be a multi-class, single-label prediction, we use categorical_crossentropy because the loss function and softmax because the final activation function. We fit the model on the train data and validate on the validation set. We run a predetermined number of epochs and can see when the model starts to overfit.

base_history = deep_model(base_model, X_train_rest, y_train_rest, X_valid, y_valid)
base_min = optimal_epoch(base_history)
eval_metric(base_model, base_history, 'loss')


In the beginning, the validation loss goes down. But at epoch 3 this stops and therefore the validation loss starts increasing rapidly. this is often when the models begin to overfit.

The training loss continues to travel down and almost reaches zero at epoch 20. this is often normal because the model is trained to suit the train data also as possible.

Handling overfitting

Now, we will attempt to do something about the overfitting. There are different options to try to to that.

  • Reduce the network’s capacity by removing layers or reducing the amount of elements within the hidden layers
  • Apply regularization, which comes right down to adding a price to the loss function for giant weights
  • Use Dropout layers, which can randomly remove certain features by setting them to zero

Reducing the network’s capacity

Our first model features a sizable amount of trainable parameters. the upper this number, the better the model can memorize the target class for every training sample. Obviously, this is often not ideal for generalizing on new data.

READ  How to migrate from Elasticsearch 1.7 to 6.8 with zero downtime

By lowering the capacity of the network, you force it to find out the patterns that matter or that minimize the loss. On the opposite hand, reducing the network’s capacity an excessive amount of will cause underfitting. The model won’t be ready to learn the relevant patterns within the train data.

We reduce the network’s capacity by removing one hidden layer and lowering the amount of elements within the remaining layer to 16.

reduced_model = models.Sequential()
reduced_model.add(layers.Dense(16, activation='relu', input_shape=(NB_WORDS,)))
reduced_model.add(layers.Dense(3, activation='softmax')) = 'Reduced model'
reduced_history = deep_model(reduced_model, X_train_rest, y_train_rest, X_valid, y_valid)
reduced_min = optimal_epoch(reduced_history)
eval_metric(reduced_model, reduced_history, 'loss')

We can see that it takes more epochs before the reduced model starts overfitting. The validation loss also goes up slower than our first model.

compare_models_by_metric(base_model, reduced_model, base_history, reduced_history, 'val_loss')


When we compare the validation loss of the baseline model, it’s clear that the reduced model starts overfitting at a later epoch. The validation loss stays lower for much longer than the baseline model.

Applying regularization

To address overfitting, we will apply weight regularization to the model. this may add a price to the loss function of the network for giant weights (or parameter values). As a result, you get an easier model which will be forced to find out only the relevant patterns within the train data.

There are L1 regularization and L2 regularization.

  • L1 regularization will add a price with regards to absolutely the value of the parameters. it’ll end in a number of the weights to be adequate to zero.
  • L2 regularization will add a price with regards to the squared value of the parameters. This leads to smaller weights.

Let’s try with L2 regularization.

reg_model = models.Sequential()
reg_model.add(layers.Dense(64, kernel_regularizer=regularizers.l2(0.001), activation='relu', input_shape=(NB_WORDS,)))
reg_model.add(layers.Dense(64, kernel_regularizer=regularizers.l2(0.001), activation='relu'))
reg_model.add(layers.Dense(3, activation='softmax')) = 'L2 Regularization model'
reg_history = deep_model(reg_model, X_train_rest, y_train_rest, X_valid, y_valid)
reg_min = optimal_epoch(reg_history)

For the regularized model we notice that it starts overfitting in the same epoch as the baseline model. However, the loss increases much slower afterward.

eval_metric(reg_model, reg_history, 'loss')

compare_models_by_metric(base_model, reg_model, base_history, reg_history, 'val_loss')

Adding dropout layers

The last option we’ll try is to add dropout layers. A dropout layer will randomly set output features of a layer to zero.

drop_model = models.Sequential()
drop_model.add(layers.Dense(64, activation='relu', input_shape=(NB_WORDS,)))
drop_model.add(layers.Dense(64, activation='relu'))
drop_model.add(layers.Dense(3, activation='softmax')) = 'Dropout layers model'
drop_history = deep_model(drop_model, X_train_rest, y_train_rest, X_valid, y_valid)
drop_min = optimal_epoch(drop_history)
eval_metric(drop_model, drop_history, 'loss')

The model with dropout layers starts overfitting later than the baseline model. The loss also increases slower than the baseline model.

compare_models_by_metric(base_model, drop_model, base_history, drop_history, 'val_loss')


The model with the dropout layers starts overfitting later. Compared to the baseline model the loss also remains much lower.

Training on the full train data and evaluation on test data

At first sight, the reduced model seems to be the best model for generalization. But let’s check that on the test set.

base_results = test_model(base_model, X_train_oh, y_train_oh, X_test_oh, y_test_oh, base_min)
reduced_results = test_model(reduced_model, X_train_oh, y_train_oh, X_test_oh, y_test_oh, reduced_min)
reg_results = test_model(reg_model, X_train_oh, y_train_oh, X_test_oh, y_test_oh, reg_min)
drop_results = test_model(drop_model, X_train_oh, y_train_oh, X_test_oh, y_test_oh, drop_min)


As shown above, all three options help to reduce overfitting. We manage to increase the accuracy on the test data substantially. Among these three options, the model with the dropout layers performs the best on the test data

About the author

Wikitechy Editor

Wikitechy Editor

Wikitechy Founder, Author, International Speaker, and Job Consultant. My role as the CEO of Wikitechy, I help businesses build their next generation digital platforms and help with their product innovation and growth strategy. I'm a frequent speaker at tech conferences and events.