Tuning a Deep Learning Model

Elan Ding

Modified: July 14, 2018 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$

Training a deep neural network is a highly empirical process. The first successful deep model I implemented was 5 layers, with learning rate of 0.001, running 12,000 iterations of gradient descent. The result was slightly better than my previous model, but not by much. You may want to ask: Where did I come up with those numbers? The answer is simply trial and error with some intuition. When the hyperparameters are chosen appropriately, the model training process looks quite simple. But in reality, much time was spent in tuning the model. Additionally, it is fun doing some other tasks, such as watching Netflix or drinking green tea, while constantly checking my neural network's performance. In this post I will show a few methods to improve a neural network's performance.

Early Stopping

Sometimes, if you run gradient descent too long, the model gains too much flexibility, leading to a high variance, so the test set performance becomes poor. By stopping early, we can, in some way, control overfitting. To implement it, we simply need to modify our model.

In [1]:
def model_early_stopping(X, Y, valid_x, valid_y, layers_dims, learning_rate = 0.0075, num_iterations = 10000, print_cost=False, print_size=100):

    costs = []

    accuracy_prev = 0

    parameters = initialize_parameters_he(layers_dims)

    for i in range(0, num_iterations):

        AL, caches = model_forward(X, parameters)

        cost = compute_cost(AL, Y)

        grads = model_backward(AL, Y, caches)

        cache = parameters

        parameters = update_parameters(parameters, grads, learning_rate=learning_rate)

        if print_cost and i % print_size == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

        if print_cost and i % 100 == 0:
            accuracy  = predict(valid_x, valid_y, parameters)
            costs.append(cost)

        if accuracy < accuracy_prev:
            parameters = cache
            print ("Early stopping at iteration {} to prevent overfitting".format(i))
            break
        else:
            accuracy_prev = accuracy

    return parameters

What happens in this code is that we continuously keep track of the model's accuracy on the validation set. Initially, the accuracy should be improving until it reaches the point of overfitting. At that stage, we will stop the model. This was the first technique that I used, and my five layer model stopped at iteration 12,000. This model gave a training set accuracy of 98% and testing set accuracy about 75%, which is slightly better than my older model. However, early stopping did not solve the problem of overfitting. Early stopping may not be the best technique here because by stopping early, the model is losing its ability to learn the training set.

L2 Regularization

Another popular technique to reduce overfitting is called L2 regularization. The idea is simple. We add onto the cost function an extra term that places a penalty to weights that are too large. That is,

$$ J_{\text{regularized}} = -\frac{1}{m}\left[\log(\bs{A}^{[L]})\bs{Y}\tr + \log(1-\bs{A}^{[l]})(1-\bs{Y})\tr \right] + \frac{\lambda}{2m} \sum_{l=1}^{L}\norm{\bs{W}^{[l]}}^2_F $$

To implement this, note that forward propagation is unaffected. All we need to do is to modify the cost function and back propagation. First let's modify the cost function.

In [2]:
def compute_cost_with_regularization(AL, Y, parameters, lambd):

    m = Y.shape[1]
    cross_entropy_cost = compute_cost(AL, Y)
    L2_regularization_cost = 0

    for l in range(len(parameters)/2):
        L2_regularization_cost += np.sum(np.square(parameters["W" + str(l+1)]))

    L2_regularization_cost = L2_regularization_cost * (1. / m) * (lambd / 2)

    cost = cross_entropy_cost + L2_regularization_cost

    return cost

Here the variable lambd refers to $\lambda$, the penalty hyperparameter. The higher $\lambda$ is, the more regularization penalty we are placing on the model.

Another fun fact from linear algebra is that we can rewrite the Frobenius norm in terms of the trace, so the derivative can be easily found as

$$ \frac{d}{d\bs{W}} \frac{\lambda}{2m} \norm{\bs{W}}^2_F = \frac{d}{d\bs{W}} \frac{\lambda}{2m} \text{Tr}(\bs{W}\bs{W}\tr) = \frac{\lambda}{m} \bs{W} $$

Hence, we can update the back propagation by making the following changes.

In [3]:
def activation_backward_with_regularization(dA, cache, activation, lambd):

    linear_cache, activation_cache = cache

    if activation == "relu":

        dZ = relu_backward(dA, activation_cache)

    elif activation == "sigmoid":

        dZ = sigmoid_backward(dA, activation_cache)

    A_prev, W, b = linear_cache
    m = A_prev.shape[1]

    dW = (1. / m) * (np.dot(dZ, A_prev.T) + lambd * W)
    db = (1. / m) * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)

    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)

    return dA_prev, dW, db
In [4]:
def model_backward_with_regularization(AL, Y, caches, lambd):

    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)

    dAL = - (nan_divide(Y, AL) - nan_divide(1 - Y, 1 - AL))

    current_cache = caches[-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = activation_backward_with_regularization(dAL, current_cache, activation="sigmoid", lambd = lambd)

    for l in reversed(range(L-1)):

        current_cache = caches[l]

        dA_prev_temp, dW_temp, db_temp = activation_backward_with_regularization(grads["dA" + str(l + 2)], current_cache, activation="relu", lambd = lambd)
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

Dropout

Dropout is a popular regularization method that is very effective for image classificaiton. The basic idea is that at each activation step, we randomly "knock out" some neurons by assigning them a value of 0. Mathematically, we multiply (elementwise) each activation layer $\bs{A}^{[l]}$ by a random matrix $\bs{D}^{[l]}$ of the same dimension. Each number in the random matrix $\bs{D}^{[l]}$ is equal to 1 with probability keep_prob and equal to 0 with probability 1-keep_prob. The following are the modifications.

In [5]:
def activation_forward_with_dropout(A_prev, W, b, activation, keep_prob):

    Z = np.dot(W, A_prev) + b

    assert(Z.shape == (W.shape[0], A_prev.shape[1]))

    linear_cache = (A_prev, W, b)

    if activation == "sigmoid":

        A, activation_cache = sigmoid(Z)

        cache = (linear_cache, activation_cache)

    elif activation == "relu":

        A, activation_cache = relu(Z)

        D = np.random.rand(A.shape[0], A.shape[1])
        D = D < keep_prob
        A = np.multiply(A, D)
        A /= keep_prob

        cache = (linear_cache, activation_cache, D)

    assert (A.shape == (W.shape[0], A_prev.shape[1]))

    return A, cache
In [6]:
def model_forward_with_dropout(X, parameters, keep_prob=0.5):

    caches = []
    A = X
    L = len(parameters) // 2

    for l in range(1, L):
        A_prev = A

        A, cache = activation_forward_with_dropout(A_prev,
                                      parameters['W' + str(l)],
                                      parameters['b' + str(l)],
                                      activation='relu',
                                      keep_prob = keep_prob)
        caches.append(cache)

    AL, cache = activation_forward_with_dropout(A,
                                   parameters['W' + str(L)],
                                   parameters['b' + str(L)],
                                   activation='sigmoid',
                                   keep_prob = keep_prob)
    caches.append(cache)

    assert(AL.shape == (1, X.shape[1]))

    return AL, caches
In [7]:
def activation_backward_with_dropout(dA, cache, activation, keep_prob):

    linear_cache, activation_cache, D = cache

    A_prev, W, b = linear_cache

    if activation == "relu":

        dA = np.multiply(dA, D)
        dA /= keep_prob

        dZ = relu_backward(dA, activation_cache)

    elif activation == "sigmoid":

        dZ = sigmoid_backward(dA, activation_cache)

    m = A_prev.shape[1]

    dW = (1. / m) * np.dot(dZ, A_prev.T)
    db = (1. / m) * np.sum(dZ, axis=1, keepdims=True)
    dA_prev = np.dot(W.T, dZ)

    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)

    return dA_prev, dW, db
In [8]:
def model_backward_with_dropout(AL, Y, caches, keep_prob):

    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)

    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

    current_cache = caches[-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = activation_backward(dAL, current_cache, activation="sigmoid")

    for l in reversed(range(L-1)):

        current_cache = caches[l]

        dA_prev_temp, dW_temp, db_temp = activation_backward_with_dropout(grads["dA" + str(l + 2)], current_cache, activation="relu", keep_prob = keep_prob)
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

At this point, everything looks good until I implemented this algorithm on optimizing a 3 layer neural network. After about 3,000 iterations, I got a divide by zero error message. It turns out that for dropout, if a neuron is knocked out, then the code np.divide will have a divide by 0 error. Also, in the original compute_cost function, the log-likelihood becomes undefined at 0. Here are an easy fix. For the cost function we redefine it as.

In [9]:
def compute_cost(AL, Y):

    m = Y.shape[1]

    logprobs = np.multiply(-np.log(AL), Y) + np.multiply(-np.log(1-AL), 1 - Y)
    cost = 1./m * np.nansum(logprobs)

    assert(cost.shape == ())

    return cost

The cool thing about np.nansum is that it treats all the NaN values as 0 when summing. Furthermore, we define a new division function.

In [10]:
def nan_divide(a, b):

    with np.errstate(divide='ignore', invalid='ignore'):
        c = np.true_divide(a,b)
        c[c == np.inf] = 0
        c = np.nan_to_num(c)
    return c

Then we replace the np.divide with the nan_divide. Everything should be working!

Wrapping Up

Finally, we put everything into a model:

In [11]:
def model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 10000, print_cost=False, print_size=100, lambd=0, keep_prob=1, continue_train = False, initial_parameters = 0):

    costs = []

    accuracy_prev = 0

    if continue_train == False:
        parameters = initialize_parameters(layers_dims)
    else:
        parameters = initial_parameters

    for i in range(0, num_iterations):

        if keep_prob == 1:
            AL, caches = model_forward(X, parameters)
        else:
            AL, caches = model_forward_with_dropout(X, parameters, keep_prob)

        if lambd == 0:
            cost = compute_cost(AL, Y)
        else:
            cost = compute_cost_with_regularization(AL, Y, parameters, lambd)

        # one or the other
        assert(lambd==0 or keep_prob==1)

        if lambd == 0 and keep_prob == 1:
            grads = model_backward(AL, Y, caches)
        elif lambd != 0:
            grads = model_backward_with_regularization(AL, Y, caches, lambd)
        elif keep_prob <1:
            grads = model_backward_with_dropout(AL, Y, caches, keep_prob)

        cache = parameters

        parameters = update_parameters(parameters, grads, learning_rate=learning_rate)

        if print_cost and i % print_size == 0:
            print ("Cost after iteration %i: %f" %(i, cost))

        if print_cost and i % 100 == 0:
            costs.append(cost)

    plt.plot(np.squeeze(costs))
    plt.ylabel('cost')
    plt.xlabel('iterations (per tens)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()

    np.save("param_" + str(len(layers_dims)-1) + "layer_" + str(i), parameters)

    return parameters

Finally, I have a neural network that I feel proud of! Although modern deep learning libraries like Tensorflow and Keras can do these tasks in a few lines of code, building a deep learning model from scratch is definitely a worthwhile experience! The entire code can be obtained from Github.