*Modified*: July 6, 2018
$\newcommand{\bs}{\boldsymbol}$
$\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$
$\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$
$\newcommand{\tr}{^{\top}}$
$\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$
$\newcommand{\given}{\,|\,}$

And for the first hidden layer, we define the activation function to be $g_1$ and in the output layer, the activation function $g_2$ is the sigmoid function. Now let's derive the gradient descent algorithm for some fun!

First let $\bs{x}$ denote the column vector of the input layer. In the picture above, $\bs{x} = [x_1, x_2, x_3]\tr$. Let $g_1$ be the activation function for the hidden layer, and let $g_2$ be the activation function for the output layer.

Let $n^{[0]}, n^{[1]}, n^{[2]}$ be the number of nodes in the input layer, hidden layer, and output layer, respectively. Here, $n^{[0]}=3, n^{[1]}=4$ and $n^{[2]}=1$.

For layers $i=1$ and $i=2$, and each node $j=1,...,n^{[i]}$, let $\bs{w}^{[i]}_j \in \mathbb{R}^{n^{[i-1]}\times 1}$ be the weight vectors and $b_j^{[i]}$ the bias parameters.

Let $z_i^{[1]} = {\bs{w}_i^{[1]}}\tr \bs{x} + b_i^{[1]}$ for $1\leq i \leq n^{[1]}$ in the hidden layer. Likewise, we define $z_i^{[2]} = {\bs{w}_i^{[2]}}\tr \bs{x} + b_i^{[2]}$ for $1\leq i \leq n^{[2]}$ in the output layer. Let $a_i^{[1]} = g_i(z_i^{[1]})$ for $1\leq i \leq n^{[1]}$ and $a_i^{[2]} = g_i(z_i^{[2]})$ for $1\leq i \leq n^{[2]}$.

This is a lot of parameters. We would like to vectorize them using matrices. So let's define

$$ \bs{W}^{[1]} = \begin{bmatrix} \cdot & {\bs{w}_1^{[1]}}\tr & \cdot \\ \cdot & {\bs{w}_2^{[1]}}\tr & \cdot \\ \cdot & {\bs{w}_3^{[1]}}\tr & \cdot \\ \cdot & {\bs{w}_4^{[1]}}\tr & \cdot \end{bmatrix}, \,\, \bs{b}^{[1]} = \begin{bmatrix} b_1^{[1]} \\ b_2^{[1]} \\ b_3^{[1]} \\ b_4^{[1]} \end{bmatrix}, \,\, \bs{z}^{[1]} = \begin{bmatrix} z_1^{[1]} \\ z_2^{[1]} \\ z_3^{[1]} \\ z_4^{[1]} \end{bmatrix}, \,\, \bs{a}^{[1]} = \begin{bmatrix} a_1^{[1]} \\ a_2^{[1]} \\ a_3^{[1]} \\ a_4^{[1]} \end{bmatrix}. $$

and similarly for $\bs{W}^{[2]}, \bs{b}^{[2]}, \bs{z}^{[2]}$, and $\bs{a}^{[2]}$.

For a single training example $(\bs{x}, y)$, the forward propagation step can be succinctly expressed as

$$ \begin{aligned} \bs{z}^{[1]} &= \bs{W}^{[1]}\bs{x} + \bs{b}^{[1]} \\ \bs{a}^{[1]} &= g_1(\bs{z}^{[1]}) \\ \bs{z}^{[2]} &= \bs{W}^{[2]} \bs{a}^{[1]} + \bs{b}^{[2]} \\ \bs{a}^{[2]} &= g_2(\bs{z}^{[2]}) \end{aligned} $$

So far, the procedure is only for a single training example. For $m$ training examples $(\bs{x}^{(1)}, y^{(1)}), ..., (\bs{x}^{(m)}, y^{(m)})$, we introduce the variables $\bs{X}, \bs{Y},\bs{Z}^{[1]}, \bs{Z}^{[2]}, \bs{A}^{[1]}, \bs{A}^{[2]}$ as the extension of their lower-case counterparts. The method of extension is to stack the lower-case vectors as columns for each training example. For instance $\bs{Y} = [y_1,..., y_m]$, $\bs{X} = [\bs{x}^{(1)},..., \bs{x}^{(m)}]$, $\bs{Z}^{[1]} = [\bs{z}^{[1](1)}, ..., \bs{z}^{[1](m)}]$, and etc.

Using this carefully designed notation, the forward propagation can be naturally extended as

$$ \begin{aligned} \bs{Z}^{[1]} &= \bs{W}^{[1]}\bs{X} + \bs{b}^{[1]} \\ \bs{A}^{[1]} &= g_1(\bs{Z}^{[1]}) \\ \bs{Z}^{[2]} &= \bs{W}^{[2]} \bs{A}^{[1]} + \bs{b}^{[2]} \\ \bs{A}^{[2]} &= g_2(\bs{Z}^{[2]}) \end{aligned} $$

where adding a column vector to a matrix is defined by `broadcasting`

(adding the same vector to each column of the matrix).

The computation graph of the neural network can be expressed by the following figure

Let's do a quick dimension check. We have $\bs{X}\in \mathbb{R}^{3\times m}$, $\bs{W}^{[1]}\in \mathbb{R}^{4\times 3}$, both $\bs{Z}^{[1]}$ and $\bs{A}^{[1]}$ are in $\mathbb{R}^{4\times m}$, and $\bs{b}^{[1]} \in \mathbb{R}^{4\times 1}$. In the output layer, we have $\bs{W}^{[2]} \in \mathbb{R}^{1\times 4}$, both $\bs{Z}^{[2]}$ and $\bs{A}^{[2]}$ are in $\mathbb{R}^{1\times m}$, and $\bs{b}^{[2]}$ is a scaler (to be broadcasted). Finally, the predicted results $\widehat{\bs{Y}}$ and the actual values $\bs{Y}$ are both row vectors in $\mathbb{R}^{1\times m}$, and the final loss function is defined similarly as the previous post to be the `negative log-likelihood`

:

$$ L(\bs{A}^{[2]}, \bs{Y}) =\frac{1}{m}\left[ -\log (\bs{A}^{[2]})\bs{Y}\tr - \log(1-\bs{A}^{[2]})(1-\bs{Y})\tr\right] $$

Here we justify operations between matrices of different dimensions using `broadcasting`

. For instance, in $(1-\bs{Y})$, the scalar $1$ is broadcasted to be a vector of 1's in the same dimension as $\bs{Y}$. Any function applied to a vector or a matrix is defined by applying the function to each element. (This will better facilitate its implementation in `numpy`

.)

Let's introduce one final piece of notation. We are going to define $d\bs{X}$ to be the derivative of the loss function with respect to $\bs{X}$. That is, we let $d\bs{X} = dL/d\bs{X}$. Note that $d\bs{X}$ has the same dimension as $\bs{X}$.

To do back propagation, we first compute $d\bs{A}^{[2]}$, which is the derivative of the loss function with respect to the row vector $\bs{A}^{[2]}$:

$$ \begin{aligned} d\bs{A}^{[2]} &= \frac{d}{d\bs{A}^{[2]}}L(\bs{A}^{[2]}, \bs{Y}) \\ &= \frac{1}{m}\left[-\frac{\bs{Y}}{\bs{A}^{[2]}} +\frac{1-\bs{Y}}{1-\bs{A}^{[2]} } \right] \end{aligned} $$

where division of two vectors of the same dimension is defined to be the elementwise division.

Next we calculated $d\bs{Z}^{[2]}$ using the chain rule:

$$ \begin{aligned} d\bs{Z}^{[2]} &= \frac{d}{d\bs{Z}^{[2]}} L(\bs{A}^{[2]}, \bs{Y}) \\ &= d\bs{A}^{[2]} * \frac{d\bs{A}^{[2]}}{d\bs{Z}^{[2]}} \\ &= d\bs{A}^{[2]} * \frac{d \sigma(\bs{Z}^{[2]})}{d\bs{Z}^{[2]}} \end{aligned} \tag{1} $$

where the asterisk * denote elementwise multiplication of matrices, and if $\bs{x}$ and $\bs{y}$ have the same dimension, $\frac{d\bs{x}}{d\bs{y}}$ means `elementwise differentiation`

. (On the otherhand, if I write $\frac{d}{d\bs{x}}\bs{y}$, then this denotes the conventional derivative.) These conventions are especially useful for deep learning when doing matrix calculus.

So, what is $\frac{d\sigma(\bs{Z}^{[2]})}{d\bs{Z}^{[2]}}$? It helps to consider the single variable case.

$$ \begin{aligned} \frac{d}{dz} \sigma(z) &= \frac{d}{dz} \frac{1}{1-e^{-z}} \\ &= \frac{-e^{-z}}{(1-e^{-z})^2} \\ &= \frac{1}{1-e^{-z}} - \frac{1}{(1-e^{-z})^2} \\ &= \sigma(z)(1-\sigma(z)) \end{aligned} $$

This is a handy property of the sigmoid function. Let's plug it in to (1):

$$ \begin{aligned} d\bs{Z}^{[2]} &= d\bs{A}^{[2]} * \bs{A}^{[2]}*(1-\bs{A}^{[2]}) \\ &= \frac{1}{m}\left[-\frac{\bs{Y}}{\bs{A}^{[2]}} +\frac{1-\bs{Y}}{1-\bs{A}^{[2]} } \right] *\bs{A}^{[2]} * (1-\bs{A}^{[2]}) \\ &= \frac{1}{m} \left[-\bs{Y}*(1-\bs{A}^{[2]}) + (1-\bs{Y}) * \bs{A}^{[2]} \right] \\ &= \frac{1}{m} \left[ \bs{A}^{[2]} - \bs{Y} \right] \end{aligned} \tag{2} $$

This is the first step of the back propagation. Let's compute the derivatives of the parameters $d\bs{W}^{[2]}$ and $d\bs{b}^{[2]}$. Recall that

$$ \bs{Z}^{[2]} = \bs{W}^{[2]}\bs{A}^{[1]} + \bs{b}^{[2]} $$

Note that $\bs{b}^{[2]}$ is really just a scalar, and the addition is enabled by `broadcasting`

. So we can rewrite this as

$$ \bs{Z}^{[2]} = \bs{W}^{[2]}\bs{A}^{[1]} + \bs{b}^{[2]} \bs{1}\tr $$

where $\bs{1} = [1, ..., 1]\tr$ is a column vector consisting of $m$ 1's.

A trick for taking derivative of such a complicated system is to always keep track of the dimensions. We wish to find $d\bs{W}^{[2]}$, which is row vector in $\mathbb{R}^{1\times 4}$. And the derivative from the previous step $d\bs{Z}^{[2]}$ is in $\mathbb{R}^{1\times m}$. Hence we need to multiply by a matrix of dimension $(m\times 4)$. This is exactly ${\bs{A}^{[1]}}\tr$. Hence,

$$ d\bs{W}^{[2]} = d\bs{Z}^{[2]}{\bs{A}^{[1]}}\tr. \tag{3} $$

Likewise, since $\bs{b}^{[2]}$ is a scaler, we find that

$$ d\bs{b}^{[2]} = d\bs{Z}^{[2]}\bs{1} = \sum_{i=1}^m \left[d\bs{Z}^{[2]}\right]_i \tag{4} $$

Now let's move on to the hidden layer! We need to find $d\bs{A}^{[1]}$ which is a matrix in $\mathbb{R}^{4\times m}$. However, the dimensions become tricky:

$$ \bs{Z}^{[2]}_{(1\times m)} = \bs{W}^{[2]}_{(1\times 4)}\bs{A}^{[1]}_{(4\times m)} + \bs{b}^{[2]}\bs{1}\tr_{(1\times m)} $$

After staring at this equation long enough, a sudden insight sprung out:

$$ d\bs{A}^{[1]} = {\bs{W}^{[2]}}\tr d\bs{Z}^{[2]}. \tag{5} $$

Indeed the dimensions match since ${\bs{W}^{[2]}}\tr$ is $(4\times 1)$ and $d\bs{Z}^{[2]}$ is $(1\times m)$, so the result is $(4\times m)$! Matrix calculus is hard, but with some intuitive thinking, it can be easy.

Now, the hard part is over, and the next step is straightforward. We find that

$$ \begin{aligned} d\bs{Z}^{[1]} &= \frac{dg_1(\bs{Z}^{[1]})}{d\bs{Z}^{[1]}} \\ &=d\bs{A}^{[1]} * g'_1(\bs{Z}^{[1]}) \end{aligned} \tag{6} $$

where $g_1$ is the activation function for the hidden layer. For example, it can be the `ReLU`

function.

Finally we find the update for the weight parameter:

$$ d\bs{W}^{[1]}_{(4\times 3)} = d\bs{Z}^{[1]}_{(4\times m)} \bs{X}\tr_{(m\times 3)} \tag{7} $$

And the lastly, the update for $\bs{b}^{[1]}$ is

$$ d\bs{b}^{[1]} = d\bs{Z}^{[1]}_{(4\times m)} \bs{1}\tr_{(m\times 1)} = \sum_{i=1}^m \left[d\bs{Z}^{[1]}\right]_i $$

To summarize, the gradient descent for this two-layered neural network is done by the following algorithm.

$$ \begin{aligned} \bs{Z}^{[1]} &= \bs{W}^{[1]}\bs{X} + \bs{b}^{[1]} \\ \bs{A}^{[1]} &= g_1(\bs{Z}^{[1]}) \\ \bs{Z}^{[2]} &= \bs{W}^{[2]} \bs{A}^{[1]} + \bs{b}^{[2]} \\ \bs{A}^{[2]} &= g_2(\bs{Z}^{[2]}) \end{aligned} $$

$$ \begin{aligned} d\bs{Z}^{[2]} &= \frac{1}{m} \left[ \bs{A}^{[2]} - \bs{Y} \right] \\ d\bs{W}^{[2]} &= d\bs{Z}^{[2]} {\bs{A}^{[1]}}\tr \\ d\bs{b}^{[2]} &= \sum_{i=1}^m \left[d\bs{Z}^{[2]}\right]_i \\ d\bs{Z}^{[1]} &= {\bs{W}^{[2]}}\tr d\bs{Z}^{[2]} * g'_1(\bs{Z}^{[1]}) \\ d\bs{W}^{[1]} &= d\bs{Z}^{[1]} \bs{X}\tr \\ d\bs{b}^{[1]} &= \sum_{i=1}^m \left[d\bs{Z}^{[1]}\right]_i \end{aligned} $$