*Modified*: July 14, 2018
$\newcommand{\bs}{\boldsymbol}$
$\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$
$\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$
$\newcommand{\tr}{^{\top}}$
$\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$
$\newcommand{\given}{\,|\,}$

Training a deep neural network is a highly empirical process. The first successful deep model I implemented was 5 layers, with learning rate of 0.001, running 12,000 iterations of gradient descent. The result was slightly better than my previous model, but not by much. You may want to ask: Where did I come up with those numbers? The answer is simply trial and error with some intuition. When the hyperparameters are chosen appropriately, the model training process looks quite simple. But in reality, much time was spent in tuning the model. Additionally, it is fun doing some other tasks, such as watching Netflix or drinking green tea, while constantly checking my neural network's performance. In this post I will show a few methods to improve a neural network's performance.

Sometimes, if you run gradient descent too long, the model gains too much flexibility, leading to a high variance, so the test set performance becomes poor. By stopping early, we can, in some way, control overfitting. To implement it, we simply need to modify our model.

In [1]:

```
def model_early_stopping(X, Y, valid_x, valid_y, layers_dims, learning_rate = 0.0075, num_iterations = 10000, print_cost=False, print_size=100):
costs = []
accuracy_prev = 0
parameters = initialize_parameters_he(layers_dims)
for i in range(0, num_iterations):
AL, caches = model_forward(X, parameters)
cost = compute_cost(AL, Y)
grads = model_backward(AL, Y, caches)
cache = parameters
parameters = update_parameters(parameters, grads, learning_rate=learning_rate)
if print_cost and i % print_size == 0:
print ("Cost after iteration %i: %f" %(i, cost))
if print_cost and i % 100 == 0:
accuracy = predict(valid_x, valid_y, parameters)
costs.append(cost)
if accuracy < accuracy_prev:
parameters = cache
print ("Early stopping at iteration {} to prevent overfitting".format(i))
break
else:
accuracy_prev = accuracy
return parameters
```

What happens in this code is that we continuously keep track of the model's accuracy on the `validation set`

. Initially, the accuracy should be improving until it reaches the point of overfitting. At that stage, we will stop the model. This was the first technique that I used, and my five layer model stopped at iteration 12,000. This model gave a training set accuracy of 98% and testing set accuracy about 75%, which is slightly better than my older model. However, early stopping did not solve the problem of overfitting. Early stopping may not be the best technique here because by stopping early, the model is losing its ability to learn the training set.

Another popular technique to reduce overfitting is called **L2 regularization**. The idea is simple. We add onto the cost function an extra term that places a penalty to weights that are too large. That is,

$$ J_{\text{regularized}} = -\frac{1}{m}\left[\log(\bs{A}^{[L]})\bs{Y}\tr + \log(1-\bs{A}^{[l]})(1-\bs{Y})\tr \right] + \frac{\lambda}{2m} \sum_{l=1}^{L}\norm{\bs{W}^{[l]}}^2_F $$

To implement this, note that forward propagation is unaffected. All we need to do is to modify the cost function and back propagation. First let's modify the cost function.

In [2]:

```
def compute_cost_with_regularization(AL, Y, parameters, lambd):
m = Y.shape[1]
cross_entropy_cost = compute_cost(AL, Y)
L2_regularization_cost = 0
for l in range(len(parameters)/2):
L2_regularization_cost += np.sum(np.square(parameters["W" + str(l+1)]))
L2_regularization_cost = L2_regularization_cost * (1. / m) * (lambd / 2)
cost = cross_entropy_cost + L2_regularization_cost
return cost
```

Here the variable `lambd`

refers to $\lambda$, the penalty hyperparameter. The higher $\lambda$ is, the more regularization penalty we are placing on the model.

Another fun fact from linear algebra is that we can rewrite the Frobenius norm in terms of the trace, so the derivative can be easily found as

$$ \frac{d}{d\bs{W}} \frac{\lambda}{2m} \norm{\bs{W}}^2_F = \frac{d}{d\bs{W}} \frac{\lambda}{2m} \text{Tr}(\bs{W}\bs{W}\tr) = \frac{\lambda}{m} \bs{W} $$

Hence, we can update the back propagation by making the following changes.

In [3]:

```
def activation_backward_with_regularization(dA, cache, activation, lambd):
linear_cache, activation_cache = cache
if activation == "relu":
dZ = relu_backward(dA, activation_cache)
elif activation == "sigmoid":
dZ = sigmoid_backward(dA, activation_cache)
A_prev, W, b = linear_cache
m = A_prev.shape[1]
dW = (1. / m) * (np.dot(dZ, A_prev.T) + lambd * W)
db = (1. / m) * np.sum(dZ, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZ)
assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)
return dA_prev, dW, db
```

In [4]:

```
def model_backward_with_regularization(AL, Y, caches, lambd):
grads = {}
L = len(caches)
m = AL.shape[1]
Y = Y.reshape(AL.shape)
dAL = - (nan_divide(Y, AL) - nan_divide(1 - Y, 1 - AL))
current_cache = caches[-1]
grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = activation_backward_with_regularization(dAL, current_cache, activation="sigmoid", lambd = lambd)
for l in reversed(range(L-1)):
current_cache = caches[l]
dA_prev_temp, dW_temp, db_temp = activation_backward_with_regularization(grads["dA" + str(l + 2)], current_cache, activation="relu", lambd = lambd)
grads["dA" + str(l + 1)] = dA_prev_temp
grads["dW" + str(l + 1)] = dW_temp
grads["db" + str(l + 1)] = db_temp
return grads
```

**Dropout** is a popular regularization method that is very effective for image classificaiton. The basic idea is that at each activation step, we randomly "knock out" some neurons by assigning them a value of 0. Mathematically, we multiply (elementwise) each activation layer $\bs{A}^{[l]}$ by a random matrix $\bs{D}^{[l]}$ of the same dimension. Each number in the random matrix $\bs{D}^{[l]}$ is equal to 1 with probability `keep_prob`

and equal to 0 with probability `1-keep_prob`

. The following are the modifications.

In [5]:

```
def activation_forward_with_dropout(A_prev, W, b, activation, keep_prob):
Z = np.dot(W, A_prev) + b
assert(Z.shape == (W.shape[0], A_prev.shape[1]))
linear_cache = (A_prev, W, b)
if activation == "sigmoid":
A, activation_cache = sigmoid(Z)
cache = (linear_cache, activation_cache)
elif activation == "relu":
A, activation_cache = relu(Z)
D = np.random.rand(A.shape[0], A.shape[1])
D = D < keep_prob
A = np.multiply(A, D)
A /= keep_prob
cache = (linear_cache, activation_cache, D)
assert (A.shape == (W.shape[0], A_prev.shape[1]))
return A, cache
```

In [6]:

```
def model_forward_with_dropout(X, parameters, keep_prob=0.5):
caches = []
A = X
L = len(parameters) // 2
for l in range(1, L):
A_prev = A
A, cache = activation_forward_with_dropout(A_prev,
parameters['W' + str(l)],
parameters['b' + str(l)],
activation='relu',
keep_prob = keep_prob)
caches.append(cache)
AL, cache = activation_forward_with_dropout(A,
parameters['W' + str(L)],
parameters['b' + str(L)],
activation='sigmoid',
keep_prob = keep_prob)
caches.append(cache)
assert(AL.shape == (1, X.shape[1]))
return AL, caches
```

In [7]:

```
def activation_backward_with_dropout(dA, cache, activation, keep_prob):
linear_cache, activation_cache, D = cache
A_prev, W, b = linear_cache
if activation == "relu":
dA = np.multiply(dA, D)
dA /= keep_prob
dZ = relu_backward(dA, activation_cache)
elif activation == "sigmoid":
dZ = sigmoid_backward(dA, activation_cache)
m = A_prev.shape[1]
dW = (1. / m) * np.dot(dZ, A_prev.T)
db = (1. / m) * np.sum(dZ, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZ)
assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)
return dA_prev, dW, db
```

In [8]:

```
def model_backward_with_dropout(AL, Y, caches, keep_prob):
grads = {}
L = len(caches)
m = AL.shape[1]
Y = Y.reshape(AL.shape)
dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))
current_cache = caches[-1]
grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = activation_backward(dAL, current_cache, activation="sigmoid")
for l in reversed(range(L-1)):
current_cache = caches[l]
dA_prev_temp, dW_temp, db_temp = activation_backward_with_dropout(grads["dA" + str(l + 2)], current_cache, activation="relu", keep_prob = keep_prob)
grads["dA" + str(l + 1)] = dA_prev_temp
grads["dW" + str(l + 1)] = dW_temp
grads["db" + str(l + 1)] = db_temp
return grads
```

At this point, everything looks good until I implemented this algorithm on optimizing a 3 layer neural network. After about 3,000 iterations, I got a `divide by zero`

error message. It turns out that for dropout, if a neuron is knocked out, then the code `np.divide`

will have a divide by 0 error. Also, in the original `compute_cost`

function, the log-likelihood becomes undefined at 0. Here are an easy fix. For the cost function we redefine it as.

In [9]:

```
def compute_cost(AL, Y):
m = Y.shape[1]
logprobs = np.multiply(-np.log(AL), Y) + np.multiply(-np.log(1-AL), 1 - Y)
cost = 1./m * np.nansum(logprobs)
assert(cost.shape == ())
return cost
```

The cool thing about `np.nansum`

is that it treats all the NaN values as 0 when summing. Furthermore, we define a new division function.

In [10]:

```
def nan_divide(a, b):
with np.errstate(divide='ignore', invalid='ignore'):
c = np.true_divide(a,b)
c[c == np.inf] = 0
c = np.nan_to_num(c)
return c
```

Then we replace the `np.divide`

with the `nan_divide`

. Everything should be working!

Finally, we put everything into a model:

In [11]:

```
def model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 10000, print_cost=False, print_size=100, lambd=0, keep_prob=1, continue_train = False, initial_parameters = 0):
costs = []
accuracy_prev = 0
if continue_train == False:
parameters = initialize_parameters(layers_dims)
else:
parameters = initial_parameters
for i in range(0, num_iterations):
if keep_prob == 1:
AL, caches = model_forward(X, parameters)
else:
AL, caches = model_forward_with_dropout(X, parameters, keep_prob)
if lambd == 0:
cost = compute_cost(AL, Y)
else:
cost = compute_cost_with_regularization(AL, Y, parameters, lambd)
# one or the other
assert(lambd==0 or keep_prob==1)
if lambd == 0 and keep_prob == 1:
grads = model_backward(AL, Y, caches)
elif lambd != 0:
grads = model_backward_with_regularization(AL, Y, caches, lambd)
elif keep_prob <1:
grads = model_backward_with_dropout(AL, Y, caches, keep_prob)
cache = parameters
parameters = update_parameters(parameters, grads, learning_rate=learning_rate)
if print_cost and i % print_size == 0:
print ("Cost after iteration %i: %f" %(i, cost))
if print_cost and i % 100 == 0:
costs.append(cost)
plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
np.save("param_" + str(len(layers_dims)-1) + "layer_" + str(i), parameters)
return parameters
```

Finally, I have a neural network that I feel proud of! Although modern deep learning libraries like `Tensorflow`

and `Keras`

can do these tasks in a few lines of code, building a deep learning model from scratch is definitely a worthwhile experience! The entire code can be obtained from Github.