# Back Propagation in a Neural Network¶

### Elan Ding¶

Modified: July 6, 2018 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$

Suppose we have a simple 2-layer neural network shown below. I drew this using a software called inkscape very quickly. It is free and easy to learn compared to tikz. It supports math typing using the tex text extension.

And for the first hidden layer, we define the activation function to be $g_1$ and in the output layer, the activation function $g_2$ is the sigmoid function. Now let's derive the gradient descent algorithm for some fun!

## Forward propagation¶

First let $\bs{x}$ denote the column vector of the input layer. In the picture above, $\bs{x} = [x_1, x_2, x_3]\tr$. Let $g_1$ be the activation function for the hidden layer, and let $g_2$ be the activation function for the output layer.

Let $n^{[0]}, n^{[1]}, n^{[2]}$ be the number of nodes in the input layer, hidden layer, and output layer, respectively. Here, $n^{[0]}=3, n^{[1]}=4$ and $n^{[2]}=1$.

For layers $i=1$ and $i=2$, and each node $j=1,...,n^{[i]}$, let $\bs{w}^{[i]}_j \in \mathbb{R}^{n^{[i-1]}\times 1}$ be the weight vectors and $b_j^{[i]}$ the bias parameters.

Let $z_i^{[1]} = {\bs{w}_i^{[1]}}\tr \bs{x} + b_i^{[1]}$ for $1\leq i \leq n^{[1]}$ in the hidden layer. Likewise, we define $z_i^{[2]} = {\bs{w}_i^{[2]}}\tr \bs{x} + b_i^{[2]}$ for $1\leq i \leq n^{[2]}$ in the output layer. Let $a_i^{[1]} = g_i(z_i^{[1]})$ for $1\leq i \leq n^{[1]}$ and $a_i^{[2]} = g_i(z_i^{[2]})$ for $1\leq i \leq n^{[2]}$.

This is a lot of parameters. We would like to vectorize them using matrices. So let's define

$$\bs{W}^{[1]} = \begin{bmatrix} \cdot & {\bs{w}_1^{[1]}}\tr & \cdot \\ \cdot & {\bs{w}_2^{[1]}}\tr & \cdot \\ \cdot & {\bs{w}_3^{[1]}}\tr & \cdot \\ \cdot & {\bs{w}_4^{[1]}}\tr & \cdot \end{bmatrix}, \,\, \bs{b}^{[1]} = \begin{bmatrix} b_1^{[1]} \\ b_2^{[1]} \\ b_3^{[1]} \\ b_4^{[1]} \end{bmatrix}, \,\, \bs{z}^{[1]} = \begin{bmatrix} z_1^{[1]} \\ z_2^{[1]} \\ z_3^{[1]} \\ z_4^{[1]} \end{bmatrix}, \,\, \bs{a}^{[1]} = \begin{bmatrix} a_1^{[1]} \\ a_2^{[1]} \\ a_3^{[1]} \\ a_4^{[1]} \end{bmatrix}.$$

and similarly for $\bs{W}^{[2]}, \bs{b}^{[2]}, \bs{z}^{[2]}$, and $\bs{a}^{[2]}$.

For a single training example $(\bs{x}, y)$, the forward propagation step can be succinctly expressed as

\begin{aligned} \bs{z}^{[1]} &= \bs{W}^{[1]}\bs{x} + \bs{b}^{[1]} \\ \bs{a}^{[1]} &= g_1(\bs{z}^{[1]}) \\ \bs{z}^{[2]} &= \bs{W}^{[2]} \bs{a}^{[1]} + \bs{b}^{[2]} \\ \bs{a}^{[2]} &= g_2(\bs{z}^{[2]}) \end{aligned}

So far, the procedure is only for a single training example. For $m$ training examples $(\bs{x}^{(1)}, y^{(1)}), ..., (\bs{x}^{(m)}, y^{(m)})$, we introduce the variables $\bs{X}, \bs{Y},\bs{Z}^{[1]}, \bs{Z}^{[2]}, \bs{A}^{[1]}, \bs{A}^{[2]}$ as the extension of their lower-case counterparts. The method of extension is to stack the lower-case vectors as columns for each training example. For instance $\bs{Y} = [y_1,..., y_m]$, $\bs{X} = [\bs{x}^{(1)},..., \bs{x}^{(m)}]$, $\bs{Z}^{[1]} = [\bs{z}^{[1](1)}, ..., \bs{z}^{[1](m)}]$, and etc.

Using this carefully designed notation, the forward propagation can be naturally extended as

\begin{aligned} \bs{Z}^{[1]} &= \bs{W}^{[1]}\bs{X} + \bs{b}^{[1]} \\ \bs{A}^{[1]} &= g_1(\bs{Z}^{[1]}) \\ \bs{Z}^{[2]} &= \bs{W}^{[2]} \bs{A}^{[1]} + \bs{b}^{[2]} \\ \bs{A}^{[2]} &= g_2(\bs{Z}^{[2]}) \end{aligned}

where adding a column vector to a matrix is defined by broadcasting (adding the same vector to each column of the matrix).

## Backward propagation¶

The computation graph of the neural network can be expressed by the following figure

Let's do a quick dimension check. We have $\bs{X}\in \mathbb{R}^{3\times m}$, $\bs{W}^{[1]}\in \mathbb{R}^{4\times 3}$, both $\bs{Z}^{[1]}$ and $\bs{A}^{[1]}$ are in $\mathbb{R}^{4\times m}$, and $\bs{b}^{[1]} \in \mathbb{R}^{4\times 1}$. In the output layer, we have $\bs{W}^{[2]} \in \mathbb{R}^{1\times 4}$, both $\bs{Z}^{[2]}$ and $\bs{A}^{[2]}$ are in $\mathbb{R}^{1\times m}$, and $\bs{b}^{[2]}$ is a scaler (to be broadcasted). Finally, the predicted results $\widehat{\bs{Y}}$ and the actual values $\bs{Y}$ are both row vectors in $\mathbb{R}^{1\times m}$, and the final loss function is defined similarly as the previous post to be the negative log-likelihood:

$$L(\bs{A}^{[2]}, \bs{Y}) =\frac{1}{m}\left[ -\log (\bs{A}^{[2]})\bs{Y}\tr - \log(1-\bs{A}^{[2]})(1-\bs{Y})\tr\right]$$

Here we justify operations between matrices of different dimensions using broadcasting. For instance, in $(1-\bs{Y})$, the scalar $1$ is broadcasted to be a vector of 1's in the same dimension as $\bs{Y}$. Any function applied to a vector or a matrix is defined by applying the function to each element. (This will better facilitate its implementation in numpy.)

Let's introduce one final piece of notation. We are going to define $d\bs{X}$ to be the derivative of the loss function with respect to $\bs{X}$. That is, we let $d\bs{X} = dL/d\bs{X}$. Note that $d\bs{X}$ has the same dimension as $\bs{X}$.

To do back propagation, we first compute $d\bs{A}^{[2]}$, which is the derivative of the loss function with respect to the row vector $\bs{A}^{[2]}$:

\begin{aligned} d\bs{A}^{[2]} &= \frac{d}{d\bs{A}^{[2]}}L(\bs{A}^{[2]}, \bs{Y}) \\ &= \frac{1}{m}\left[-\frac{\bs{Y}}{\bs{A}^{[2]}} +\frac{1-\bs{Y}}{1-\bs{A}^{[2]} } \right] \end{aligned}

where division of two vectors of the same dimension is defined to be the elementwise division.

Next we calculated $d\bs{Z}^{[2]}$ using the chain rule:

\begin{aligned} d\bs{Z}^{[2]} &= \frac{d}{d\bs{Z}^{[2]}} L(\bs{A}^{[2]}, \bs{Y}) \\ &= d\bs{A}^{[2]} * \frac{d\bs{A}^{[2]}}{d\bs{Z}^{[2]}} \\ &= d\bs{A}^{[2]} * \frac{d \sigma(\bs{Z}^{[2]})}{d\bs{Z}^{[2]}} \end{aligned} \tag{1}

where the asterisk * denote elementwise multiplication of matrices, and if $\bs{x}$ and $\bs{y}$ have the same dimension, $\frac{d\bs{x}}{d\bs{y}}$ means elementwise differentiation. (On the otherhand, if I write $\frac{d}{d\bs{x}}\bs{y}$, then this denotes the conventional derivative.) These conventions are especially useful for deep learning when doing matrix calculus.

So, what is $\frac{d\sigma(\bs{Z}^{[2]})}{d\bs{Z}^{[2]}}$? It helps to consider the single variable case.

\begin{aligned} \frac{d}{dz} \sigma(z) &= \frac{d}{dz} \frac{1}{1-e^{-z}} \\ &= \frac{-e^{-z}}{(1-e^{-z})^2} \\ &= \frac{1}{1-e^{-z}} - \frac{1}{(1-e^{-z})^2} \\ &= \sigma(z)(1-\sigma(z)) \end{aligned}

This is a handy property of the sigmoid function. Let's plug it in to (1):

\begin{aligned} d\bs{Z}^{[2]} &= d\bs{A}^{[2]} * \bs{A}^{[2]}*(1-\bs{A}^{[2]}) \\ &= \frac{1}{m}\left[-\frac{\bs{Y}}{\bs{A}^{[2]}} +\frac{1-\bs{Y}}{1-\bs{A}^{[2]} } \right] *\bs{A}^{[2]} * (1-\bs{A}^{[2]}) \\ &= \frac{1}{m} \left[-\bs{Y}*(1-\bs{A}^{[2]}) + (1-\bs{Y}) * \bs{A}^{[2]} \right] \\ &= \frac{1}{m} \left[ \bs{A}^{[2]} - \bs{Y} \right] \end{aligned} \tag{2}

This is the first step of the back propagation. Let's compute the derivatives of the parameters $d\bs{W}^{[2]}$ and $d\bs{b}^{[2]}$. Recall that

$$\bs{Z}^{[2]} = \bs{W}^{[2]}\bs{A}^{[1]} + \bs{b}^{[2]}$$

Note that $\bs{b}^{[2]}$ is really just a scalar, and the addition is enabled by broadcasting. So we can rewrite this as

$$\bs{Z}^{[2]} = \bs{W}^{[2]}\bs{A}^{[1]} + \bs{b}^{[2]} \bs{1}\tr$$

where $\bs{1} = [1, ..., 1]\tr$ is a column vector consisting of $m$ 1's.

A trick for taking derivative of such a complicated system is to always keep track of the dimensions. We wish to find $d\bs{W}^{[2]}$, which is row vector in $\mathbb{R}^{1\times 4}$. And the derivative from the previous step $d\bs{Z}^{[2]}$ is in $\mathbb{R}^{1\times m}$. Hence we need to multiply by a matrix of dimension $(m\times 4)$. This is exactly ${\bs{A}^{[1]}}\tr$. Hence,

$$d\bs{W}^{[2]} = d\bs{Z}^{[2]}{\bs{A}^{[1]}}\tr. \tag{3}$$

Likewise, since $\bs{b}^{[2]}$ is a scaler, we find that

$$d\bs{b}^{[2]} = d\bs{Z}^{[2]}\bs{1} = \sum_{i=1}^m \left[d\bs{Z}^{[2]}\right]_i \tag{4}$$

Now let's move on to the hidden layer! We need to find $d\bs{A}^{[1]}$ which is a matrix in $\mathbb{R}^{4\times m}$. However, the dimensions become tricky:

$$\bs{Z}^{[2]}_{(1\times m)} = \bs{W}^{[2]}_{(1\times 4)}\bs{A}^{[1]}_{(4\times m)} + \bs{b}^{[2]}\bs{1}\tr_{(1\times m)}$$

After staring at this equation long enough, a sudden insight sprung out:

$$d\bs{A}^{[1]} = {\bs{W}^{[2]}}\tr d\bs{Z}^{[2]}. \tag{5}$$

Indeed the dimensions match since ${\bs{W}^{[2]}}\tr$ is $(4\times 1)$ and $d\bs{Z}^{[2]}$ is $(1\times m)$, so the result is $(4\times m)$! Matrix calculus is hard, but with some intuitive thinking, it can be easy.

Now, the hard part is over, and the next step is straightforward. We find that

\begin{aligned} d\bs{Z}^{[1]} &= \frac{dg_1(\bs{Z}^{[1]})}{d\bs{Z}^{[1]}} \\ &=d\bs{A}^{[1]} * g'_1(\bs{Z}^{[1]}) \end{aligned} \tag{6}

where $g_1$ is the activation function for the hidden layer. For example, it can be the ReLU function.

Finally we find the update for the weight parameter:

$$d\bs{W}^{[1]}_{(4\times 3)} = d\bs{Z}^{[1]}_{(4\times m)} \bs{X}\tr_{(m\times 3)} \tag{7}$$

And the lastly, the update for $\bs{b}^{[1]}$ is

$$d\bs{b}^{[1]} = d\bs{Z}^{[1]}_{(4\times m)} \bs{1}\tr_{(m\times 1)} = \sum_{i=1}^m \left[d\bs{Z}^{[1]}\right]_i$$

To summarize, the gradient descent for this two-layered neural network is done by the following algorithm.

### Forward propagation:¶

\begin{aligned} \bs{Z}^{[1]} &= \bs{W}^{[1]}\bs{X} + \bs{b}^{[1]} \\ \bs{A}^{[1]} &= g_1(\bs{Z}^{[1]}) \\ \bs{Z}^{[2]} &= \bs{W}^{[2]} \bs{A}^{[1]} + \bs{b}^{[2]} \\ \bs{A}^{[2]} &= g_2(\bs{Z}^{[2]}) \end{aligned}

### Backward propagation:¶

\begin{aligned} d\bs{Z}^{[2]} &= \frac{1}{m} \left[ \bs{A}^{[2]} - \bs{Y} \right] \\ d\bs{W}^{[2]} &= d\bs{Z}^{[2]} {\bs{A}^{[1]}}\tr \\ d\bs{b}^{[2]} &= \sum_{i=1}^m \left[d\bs{Z}^{[2]}\right]_i \\ d\bs{Z}^{[1]} &= {\bs{W}^{[2]}}\tr d\bs{Z}^{[2]} * g'_1(\bs{Z}^{[1]}) \\ d\bs{W}^{[1]} &= d\bs{Z}^{[1]} \bs{X}\tr \\ d\bs{b}^{[1]} &= \sum_{i=1}^m \left[d\bs{Z}^{[1]}\right]_i \end{aligned}