Building a Deep Neural Network from Scratch

Elan Ding

Modified: July 10, 2018 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$

In this post, I will implement a deep neural network from scratch using only numpy. That's right! Just numpy, my favorite library, and nothing else!

In [1]:
import numpy as np

Forward Propagation

This time I will define the sigmoid and ReLU activation functions slightly differently. Since gradient descent is a long and iterative process, it helps to cache some value so that we do not need to compute it over and over again.

In [2]:
def sigmoid(Z):

    A = 1/(1+np.exp(-Z))
    cache = Z

    return A, cache
In [3]:
def relu(Z):

    A = np.maximum(0,Z)

    assert(A.shape == Z.shape)

    cache = Z
    return A, cache

Next we will define a function that initializes the weights $\bs{W}^{[l]}$ by randomly assigning each entry in the matrix a number between -1 and 1, then we shrink its size by multiplying it by $0.01$. The parameter $\bs{b}^{[l]}$ is set to 0.

In [4]:
def initialize_parameters(layer_dims):

    parameters = {}
    L = len(layer_dims)

    for l in range(1, L):

        parameters['W' + str(l)] = np.random.randn(layer_dims[l], layer_dims[l - 1]) * 0.01
        parameters['b' + str(l)] = np.zeros((layer_dims[l], 1))

        assert(parameters['W' + str(l)].shape == (layer_dims[l], layer_dims[l - 1]))
        assert(parameters['b' + str(l)].shape == (layer_dims[l], 1))

    return parameters

Each step of forward propagation in a neural network is given by

$$ \begin{aligned} \bs{Z}^{[l]} &= \bs{W}^{[l]} \bs{A}^{[l-1]} + \bs{b}^{[l]} \\ \bs{A}^{[l]} &= g_l(\bs{Z}^{[l]}) \end{aligned} $$

where $g_l$ is the activation in the $l$-th layer. For this exploration, for layers $1,..., L-1$, we use the ReLU function. For the output layer $L$, we use the sigmoid function.

In [5]:
def activation_forward(A_prev, W, b, activation):

    if activation == "sigmoid":

        Z =, A_prev) + b

        assert(Z.shape == (W.shape[0], A_prev.shape[1]))

        linear_cache = (A_prev, W, b)

        A, activation_cache = sigmoid(Z)

    elif activation == "relu":

        Z =, A_prev) + b

        assert(Z.shape == (W.shape[0], A_prev.shape[1]))

        linear_cache = (A_prev, W, b)

        A, activation_cache = relu(Z)

    assert (A.shape == (W.shape[0], A_prev.shape[1]))
    cache = (linear_cache, activation_cache)

    return A, cache

Note that in each step of activation_forward defined above, we get the value of $\bs{A}^{[l]}$ and a cache, which consists of linear_cache ($\bs{A}^{[l-1]}, \bs{W}^{[l]}, \bs{b}^{[l]}$), and activation_cache ($\bs{Z}^{[l]}$).

Next, let's define a new function for the entire forward propagation.

In [6]:
def model_forward(X, parameters):

    caches = []
    A = X
    L = len(parameters) // 2

    for l in range(1, L):
        A_prev = A

        A, cache = activation_forward(A_prev,
                                      parameters['W' + str(l)],
                                      parameters['b' + str(l)],

    AL, cache = activation_forward(A,
                                   parameters['W' + str(L)],
                                   parameters['b' + str(L)],

    assert(AL.shape == (1, X.shape[1]))

    return AL, caches

From the model_forward function, we get the final prediction values AL, which is $\widehat{\bs{Y}}$, as well as a list of caches which consists of all the values of the tuple (linear_cache, activation_cache) from the activation_forward function defined earlier.

Finally we have reached the end of forward activation! We can now compute the loss function by finding the negative log-likelihood.

In [7]:
def compute_cost(AL, Y):

    m = Y.shape[1]

    cost = (-1 / m) * (, np.log(AL).T) + - Y, np.log(1 - AL).T))

    cost = np.squeeze(cost)
    assert(cost.shape == ())

    return cost

Backward Propagation

Next we move on to backward propagation. Each step of backward propagation can be summarized by

$$ \begin{aligned} d\bs{Z}^{[l]} &= d\bs{A}^{[l]} * \frac{d\bs{A}^{[l]}}{d\bs{Z}^{[l]}} \\ d\bs{A}^{[l-1]} &= {\bs{W}^{[l]}}\tr d\bs{Z}^{[l]} \end{aligned} \tag{1} $$

Check my previous post for the derivation. The derivative $d\bs{A}^{[l]}/d\bs{Z}^{[l]}= g'_l(\bs{Z}^{[l]})$ is simply the derivative of the activation function in the $l$-th layer. The first equation in (1) is defined in the relu_backward and sigmoid_backward functions below. These functions take as inputs the matrix $d\bs{A}^{[l]}$, which can be obtained from previous iteration, and a cache variable, which is simply the value of $\bs{Z}^{[l]}$ cached from forward propagation.

In [8]:
def relu_backward(dA, cache):

    Z = cache
    dZ = np.array(dA, copy=True)

    dZ[Z <= 0] = 0

    assert (dZ.shape == Z.shape)

    return dZ
In [9]:
def sigmoid_backward(dA, cache):

    Z = cache

    s = 1/(1+np.exp(-Z))
    dZ = dA * s * (1-s)

    assert (dZ.shape == Z.shape)

    return dZ

Similar to activation_foward, we are going to define the same process for back propagation. The function activation_backward takes in values $d\bs{A}^{[l]}$, and a cache value equal to the tuple (linear_cache, activation_cache), and finally the type of activation, which can be either sigmoid or ReLU. In addition to equations (1) above, we further calculate the derivative of the weight and bias parameters:

$$ \begin{aligned} d\bs{W}^{[l]} &= d\bs{Z}^{[l]} {\bs{A}^{[l-1]}}\tr \\ d\bs{b}^{[l]} &= d\bs{Z}^{[l]} \bs{1} \end{aligned}\tag{2} $$

Again refer to my previous post for the derivation.

In [10]:
def activation_backward(dA, cache, activation):

    linear_cache, activation_cache = cache

    if activation == "relu":

        dZ = relu_backward(dA, activation_cache)

    elif activation == "sigmoid":

        dZ = sigmoid_backward(dA, activation_cache)

    A_prev, W, b = linear_cache
    m = A_prev.shape[1]

    dW = (1. / m) *, A_prev.T)
    db = (1. / m) * np.sum(dZ, axis=1, keepdims=True)
    dA_prev =, dZ)

    assert (dA_prev.shape == A_prev.shape)
    assert (dW.shape == W.shape)
    assert (db.shape == b.shape)

    return dA_prev, dW, db

Now we are ready to define the entire back propagation process in a single function. The function takes as inputs the predicted value AL, or $\widehat{\bs{Y}}$, the true value $\bs{Y}$, and the list of cached values from model_forward process, and output a dictionary of the gradients used for gradient descent.

In [11]:
def model_backward(AL, Y, caches):

    grads = {}
    L = len(caches)
    m = AL.shape[1]
    Y = Y.reshape(AL.shape)

    dAL = - (np.divide(Y, AL) - np.divide(1 - Y, 1 - AL))

    current_cache = caches[-1]
    grads["dA" + str(L)], grads["dW" + str(L)], grads["db" + str(L)] = activation_backward(dAL, current_cache, activation="sigmoid")

    for l in reversed(range(L-1)):

        current_cache = caches[l]

        dA_prev_temp, dW_temp, db_temp = activation_backward(grads["dA" + str(l + 2)], current_cache, activation="relu")
        grads["dA" + str(l + 1)] = dA_prev_temp
        grads["dW" + str(l + 1)] = dW_temp
        grads["db" + str(l + 1)] = db_temp

    return grads

We define a function that updates the parameters in a single iteration of gradient descent.

In [12]:
def update_parameters(parameters, grads, learning_rate):

    L = len(parameters) // 2

    for l in range(L):
        parameters["W" + str(l + 1)] = parameters["W" + str(l + 1)] - learning_rate * grads["dW" + str(l + 1)]
        parameters["b" + str(l + 1)] = parameters["b" + str(l + 1)] - learning_rate * grads["db" + str(l + 1)]

    return parameters

Here is the most exciting part. We have finally put together a deep neural network!

In [13]:
def deep_model(X, Y, layers_dims, learning_rate = 0.0075, num_iterations = 3000, print_cost=False):

    costs = []

    parameters = initialize_parameters(layers_dims)

    for i in range(0, num_iterations):

        AL, caches = model_forward(X, parameters)

        cost = compute_cost(AL, Y)

        grads = model_backward(AL, Y, caches)

        parameters = update_parameters(parameters, grads, learning_rate=learning_rate)

        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" %(i, cost))
        if print_cost and i % 100 == 0:

    plt.xlabel('iterations (per tens)')
    plt.title("Learning rate =" + str(learning_rate))

    return parameters

In the next post, we will be testing that on the x-ray chest image data. I can't wait!