# Building a Deep Neural Network from Scratch¶

### Elan Ding¶

Modified: July 10, 2018 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$

In this post, I will implement a deep neural network from scratch using only numpy. That's right! Just numpy, my favorite library, and nothing else!

In :
import numpy as np


## Forward Propagation¶

This time I will define the sigmoid and ReLU activation functions slightly differently. Since gradient descent is a long and iterative process, it helps to cache some value so that we do not need to compute it over and over again.

In :
def sigmoid(Z):

A = 1/(1+np.exp(-Z))
cache = Z

return A, cache

In :
def relu(Z):

A = np.maximum(0,Z)

assert(A.shape == Z.shape)

cache = Z
return A, cache


Next we will define a function that initializes the weights $\bs{W}^{[l]}$ by randomly assigning each entry in the matrix a number between -1 and 1, then we shrink its size by multiplying it by $0.01$. The parameter $\bs{b}^{[l]}$ is set to 0.

In :
def initialize_parameters(layer_dims):

np.random.seed(3)
parameters = {}
L = len(layer_dims)

for l in range(1, L):

parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) * 0.01
parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))

assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l - 1]))
assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

return parameters


Each step of forward propagation in a neural network is given by

\begin{aligned} \bs{Z}^{[l]} &= \bs{W}^{[l]} \bs{A}^{[l-1]} + \bs{b}^{[l]} \\ \bs{A}^{[l]} &= g_l(\bs{Z}^{[l]}) \end{aligned}

where $g_l$ is the activation in the $l$-th layer. For this exploration, for layers $1,..., L-1$, we use the ReLU function. For the output layer $L$, we use the sigmoid function.

In :
def activation_forward(A_prev, W, b, activation):

if activation == "sigmoid":

Z = np.dot(W, A_prev) + b

assert(Z.shape == (W.shape, A_prev.shape))

linear_cache = (A_prev, W, b)

A, activation_cache = sigmoid(Z)

elif activation == "relu":

Z = np.dot(W, A_prev) + b

assert(Z.shape == (W.shape, A_prev.shape))

linear_cache = (A_prev, W, b)

A, activation_cache = relu(Z)

assert (A.shape == (W.shape, A_prev.shape))
cache = (linear_cache, activation_cache)

return A, cache


Note that in each step of activation_forward defined above, we get the value of $\bs{A}^{[l]}$ and a cache, which consists of linear_cache ($\bs{A}^{[l-1]}, \bs{W}^{[l]}, \bs{b}^{[l]}$), and activation_cache ($\bs{Z}^{[l]}$).

Next, let's define a new function for the entire forward propagation.

In :
def model_forward(X, parameters):

caches = []
A = X
L = len(parameters) // 2

for l in range(1, L):
A_prev = A

A, cache = activation_forward(A_prev,
parameters['W' + str(l)],
parameters['b' + str(l)],
activation='relu')
caches.append(cache)

AL, cache = activation_forward(A,
parameters['W' + str(L)],
parameters['b' + str(L)],
activation='sigmoid')
caches.append(cache)

assert(AL.shape == (1, X.shape))

return AL, caches


From the model_forward function, we get the final prediction values AL, which is $\widehat{\bs{Y}}$, as well as a list of caches which consists of all the values of the tuple (linear_cache, activation_cache) from the activation_forward function defined earlier.

Finally we have reached the end of forward activation! We can now compute the loss function by finding the negative log-likelihood.

In :
def compute_cost(AL, Y):

m = Y.shape

cost = (-1 / m) * (np.dot(Y, np.log(AL).T) + np.dot(1 - Y, np.log(1 - AL).T))

cost = np.squeeze(cost)
assert(cost.shape == ())

return cost


## Backward Propagation¶

Next we move on to backward propagation. Each step of backward propagation can be summarized by

\begin{aligned} d\bs{Z}^{[l]} &= d\bs{A}^{[l]} * \frac{d\bs{A}^{[l]}}{d\bs{Z}^{[l]}} \\ d\bs{A}^{[l-1]} &= {\bs{W}^{[l]}}\tr d\bs{Z}^{[l]} \end{aligned} \tag{1}

Check my previous post for the derivation. The derivative $d\bs{A}^{[l]}/d\bs{Z}^{[l]}= g'_l(\bs{Z}^{[l]})$ is simply the derivative of the activation function in the $l$-th layer. The first equation in (1) is defined in the relu_backward and sigmoid_backward functions below. These functions take as inputs the matrix $d\bs{A}^{[l]}$, which can be obtained from previous iteration, and a cache variable, which is simply the value of $\bs{Z}^{[l]}$ cached from forward propagation.

In :
def relu_backward(dA, cache):

Z = cache
dZ = np.array(dA, copy=True)

dZ[Z <= 0] = 0

assert (dZ.shape == Z.shape)

return dZ

In :
def sigmoid_backward(dA, cache):

Z = cache

s = 1/(1+np.exp(-Z))
dZ = dA * s * (1-s)

assert (dZ.shape == Z.shape)

return dZ


Similar to activation_foward, we are going to define the same process for back propagation. The function activation_backward takes in values $d\bs{A}^{[l]}$, and a cache value equal to the tuple (linear_cache, activation_cache), and finally the type of activation, which can be either sigmoid or ReLU. In addition to equations (1) above, we further calculate the derivative of the weight and bias parameters:

\begin{aligned} d\bs{W}^{[l]} &= d\bs{Z}^{[l]} {\bs{A}^{[l-1]}}\tr \\ d\bs{b}^{[l]} &= d\bs{Z}^{[l]} \bs{1} \end{aligned}\tag{2}

Again refer to my previous post for the derivation.

In :
def activation_backward(dA, cache, activation):

linear_cache, activation_cache = cache

if activation == "relu":

dZ = relu_backward(dA, activation_cache)

elif activation == "sigmoid":

dZ = sigmoid_backward(dA, activation_cache)

A_prev, W, b = linear_cache
m = A_prev.shape

dW = (1. / m) * np.dot(dZ, A_prev.T)
db = (1. / m) * np.sum(dZ, axis=1, keepdims=True)
dA_prev = np.dot(W.T, dZ)

assert (dA_prev.shape == A_prev.shape)
assert (dW.shape == W.shape)
assert (db.shape == b.shape)

return dA_prev, dW, db


Now we are ready to define the entire back propagation process in a single function. The function takes as inputs the predicted value AL, or $\widehat{\bs{Y}}$, the true value $\bs{Y}$, and the list of cached values from model_forward process, and output a dictionary of the gradients used for gradient descent.

In :
def model_backward(AL, Y, caches):

L = len(caches)
m = AL.shape
Y = Y.reshape(AL.shape)

dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

current_cache = caches[-1]

for l in reversed(range(L-1)):

current_cache = caches[l]

dA_prev_temp, dW_temp, db_temp = activation_backward(grads["dA" + str(l + 2)], current_cache, activation="relu")
grads["dA" + str(l + 1)] = dA_prev_temp
grads["dW" + str(l + 1)] = dW_temp
grads["db" + str(l + 1)] = db_temp



We define a function that updates the parameters in a single iteration of gradient descent.

In :
def update_parameters(parameters, grads, learning_rate):

L = len(parameters) // 2

for l in range(L):
parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]

return parameters


Here is the most exciting part. We have finally put together a deep neural network!

In :
def deep_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):

costs = []

parameters = initialize_parameters(layers_dims)

for i in range(0, num_iterations):

AL, caches = model_forward(X, parameters)

cost = compute_cost(AL, Y)

if print_cost and i % 100 == 0:
print ("Cost after iteration %i: %f" %(i, cost))
if print_cost and i % 100 == 0:
costs.append(cost)

plt.plot(np.squeeze(costs))
plt.ylabel('cost')
plt.xlabel('iterations (per tens)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()

return parameters


In the next post, we will be testing that on the x-ray chest image data. I can't wait!