Artificial Neural Networks

7/22/2020 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$

In this post, we derive the gradient descent algorithm for artificial neural networks. Let's first consider binary classification. Suppose we have a simple 2-layer neural network shown below.

For the first hidden layer, we define the activation function to be $g_1$, which is typically the ReLU function, and the final activation function $g_2$ is the sigmoid function.

Forward propagation

Let $x$ denote the column vector of the input layer. In the picture above, $x = [x_1, x_2, x_3]\tr$. Let $g_1$ be the activation function for the hidden layer, and let $g_2$ be the activation function for the output layer.

Let $n^{[0]}, n^{[1]}, n^{[2]}$ be the number of nodes in the input layer, hidden layer, and output layer, respectively. Here, $n^{[0]}=3, n^{[1]}=4$ and $n^{[2]}=1$.

For layers $i=1$ and $i=2$, and each node $j=1,...,n^{[i]}$, let $w^{[i]}_j \in \mathbb{R}^{n^{[i-1]}\times 1}$ be the weight vectors and $b_j^{[i]}$ the bias parameters.

Let $z_i^{[1]} = {w_i^{[1]}}\tr \bs{x} + b_i^{[1]}$ for $1\leq i \leq n^{[1]}$ in the hidden layer. Likewise, we define $z_i^{[2]} = {w_i^{[2]}}\tr \bs{x} + b_i^{[2]}$ for $1\leq i \leq n^{[2]}$ in the output layer. Let $a_i^{[1]} = g_i(z_i^{[1]})$ for $1\leq i \leq n^{[1]}$ and $a_i^{[2]} = g_i(z_i^{[2]})$ for $1\leq i \leq n^{[2]}$.

To reduce the number of written parameters, we use the following matrix notation.

$$ W^{[1]} = \begin{bmatrix} {w_1^{[1]}}\tr \\ {w_2^{[1]}}\tr \\ {w_3^{[1]}}\tr \\ {w_4^{[1]}}\tr \end{bmatrix}, \,\, b^{[1]} = \begin{bmatrix} b_1^{[1]} \\ b_2^{[1]} \\ b_3^{[1]} \\ b_4^{[1]} \end{bmatrix}, \,\, z^{[1]} = \begin{bmatrix} z_1^{[1]} \\ z_2^{[1]} \\ z_3^{[1]} \\ z_4^{[1]} \end{bmatrix}, \,\, a^{[1]} = \begin{bmatrix} a_1^{[1]} \\ a_2^{[1]} \\ a_3^{[1]} \\ a_4^{[1]} \end{bmatrix}. $$

and similarly for $W^{[2]}, b^{[2]}, z^{[2]}$, and $a^{[2]}$.

For a single training example $(x, y)$, the forward propagation step can be succinctly expressed as

$$ \begin{aligned} z^{[1]} &= W^{[1]}x + b^{[1]} \\ a^{[1]} &= g_1(z^{[1]}) \\ z^{[2]} &= W^{[2]} a^{[1]} + b^{[2]} \\ a^{[2]} &= g_2(z^{[2]}) \end{aligned} $$

So far, the procedure is only for a single training example. For $m$ training examples $(x^{(1)}, y^{(1)}), ..., (x^{(m)}, y^{(m)})$, we introduce the variables $X, Y, Z^{[1]}, Z^{[2]}, A^{[1]}, A^{[2]}$ as the extension of their lower-case counterparts. The method of extension is to stack the lower-case vectors as columns for each training example. For instance $Y = [y_1,..., y_m]$, $X = [x^{(1)},..., x^{(m)}]$, $Z^{[1]} = [z^{[1](1)}, ..., z^{[1](m)}]$, and etc.

Using this carefully designed notation, the forward propagation can be naturally extended as

$$ \begin{aligned} Z^{[1]} &= W^{[1]}X + b^{[1]} \\ A^{[1]} &= g_1(Z^{[1]}) \\ Z^{[2]} &= W^{[2]} A^{[1]} + b^{[2]} \\ A^{[2]} &= g_2(Z^{[2]}) \end{aligned} $$

where adding a column vector to a matrix is defined by broadcasting (adding the same vector to each column of the matrix).

Backward propagation

The computation graph of the neural network can be expressed by the following figure

Let's do a quick dimension check. We have $X\in \mathbb{R}^{3\times m}$, $W^{[1]}\in \mathbb{R}^{4\times 3}$, both $Z^{[1]}$ and $A^{[1]}$ are in $\mathbb{R}^{4\times m}$, and $b^{[1]} \in \mathbb{R}^{4\times 1}$. In the output layer, we have $W^{[2]} \in \mathbb{R}^{1\times 4}$, both $Z^{[2]}$ and $A^{[2]}$ are in $\mathbb{R}^{1\times m}$, and $b^{[2]}$ is a scaler (to be broadcasted). Finally, the predicted results $\widehat{Y}$ and the actual values $Y$ are both row vectors in $\mathbb{R}^{1\times m}$, and we use the cross entropy loss function:

$$ L(A^{[2]}, Y) =\frac{1}{m}\left[ -\log (A^{[2]})Y\tr - \log(1-A^{[2]})(1-Y)\tr\right] \tag{1} $$

Here we justify operations between matrices of different dimensions using broadcasting. For instance, in $(1-Y)$, the scalar $1$ is broadcasted to be a vector of 1's in the same dimension as $Y$. Any function applied to a vector or a matrix is defined by applying the function to each element. (This will better facilitate its implementation in numpy.)

Let's introduce one final piece of notation. We are going to define $dX$ to be the derivative of the loss function with respect to $X$. That is, we let $dX = dL/dX$. Note that $dX$ has the same dimension as $X$.

To do back propagation, we first compute $dA^{[2]}$, which is the derivative of the loss function with respect to the row vector $A^{[2]}$:

$$ \begin{aligned} dA^{[2]} &= \frac{d}{dA^{[2]}}L(A^{[2]}, Y) \\ &= \frac{1}{m}\left[-\frac{Y}{A^{[2]}} +\frac{1-Y}{1-A^{[2]} } \right] \end{aligned} $$

where division of two vectors of the same dimension is defined to be the elementwise division.

Next we calculated $dZ^{[2]}$ using the chain rule:

$$ \begin{aligned} dZ^{[2]} &= \frac{d}{dZ^{[2]}} L(A^{[2]}, Y) \\ &= dA^{[2]} * \frac{dA^{[2]}}{dZ^{[2]}} \\ &= dA^{[2]} * \frac{d \sigma(Z^{[2]})}{dZ^{[2]}} \end{aligned} $$

where the asterisk * denote elementwise multiplication of matrices, and if $x$ and $y$ have the same dimension, $\frac{dx}{dy}$ means elementwise differentiation. (On the otherhand, the notation $\frac{d}{dx}y$ denotes the conventional derivative.) These conventions are especially useful for deep learning when doing matrix calculus.

So, what is $\frac{d\sigma(Z^{[2]})}{dZ^{[2]}}$? It helps to consider the single variable case.

$$ \begin{aligned} \frac{d}{dz} \sigma(z) &= \frac{d}{dz} \frac{1}{1-e^{-z}} \\ &= \frac{-e^{-z}}{(1-e^{-z})^2} \\ &= \frac{1}{1-e^{-z}} - \frac{1}{(1-e^{-z})^2} \\ &= \sigma(z)(1-\sigma(z)) \end{aligned} $$

This is a handy property of the sigmoid function. Let's plug it in to (1):

$$ \begin{aligned} d\bs{Z}^{[2]} &= dA^{[2]} * A^{[2]}*(1-A^{[2]}) \\ &= \frac{1}{m}\left[-\frac{Y}{A^{[2]}} +\frac{1-Y}{1-A^{[2]} } \right] * A^{[2]} * (1- A^{[2]}) \\ &= \frac{1}{m} \left[-Y*(1-A^{[2]}) + (1-Y) * A^{[2]} \right] \\ &= \frac{1}{m} \left[ A^{[2]} - Y \right] \end{aligned} $$

This is the first step of the back propagation. Let's compute the derivatives of the parameters $dW^{[2]}$ and $db^{[2]}$. Recall that

$$ Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]} $$

Note that $b^{[2]}$ is really just a scalar, and the addition is enabled by broadcasting. So we can rewrite this as

$$ Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]} \bs{1}\tr $$

where $\bs{1} = [1, ..., 1]\tr$ is a column vector consisting of $m$ 1's.

A trick for taking derivative of such a complicated system is to always keep track of the dimensions. We wish to find $dW^{[2]}$, which is row vector in $\mathbb{R}^{1\times 4}$. And the derivative from the previous step $dZ^{[2]}$ is in $\mathbb{R}^{1\times m}$. Hence we need to multiply by a matrix of dimension $(m\times 4)$. This is exactly ${A^{[1]}}\tr$. Hence,

$$ dW^{[2]} = dZ^{[2]}{A^{[1]}}\tr. $$

Likewise, since $b^{[2]}$ is a scaler, we find that

$$ db^{[2]} = dZ^{[2]}\bs{1} = \sum_{i=1}^m \left[dZ^{[2]}\right]_i. $$

Now let's move on to the hidden layer! We need to find $dA^{[1]}$ which is a matrix in $\mathbb{R}^{4\times m}$. However, the dimensions become tricky:

$$ Z^{[2]}_{(1\times m)} = W^{[2]}_{(1\times 4)}A^{[1]}_{(4\times m)} + b^{[2]}\bs{1}\tr_{(1\times m)} $$

The derivative $dA$ can be obtained by carefully keeping track of the dimensions. Equation (5) holds since ${W^{[2]}}\tr$ is $(4\times 1)$ and $dZ^{[2]}$ is $(1\times m)$, and the result is $(4\times m)$.

$$ dA^{[1]} = {W^{[2]}}\tr dZ^{[2]}. $$

Suppose $g_1$ is the ReLU function. Its derivative can be easily computed as

$$ g_1(x) = \begin{cases} x & \text{ if } x > 0 \\ 0 & \text{ otherwise.} \end{cases} \implies g'_1(x) = I(x > 0), $$

where $I(\cdot)$ is the indicator function. Hence the derivative with respect to the first layer can be computed as

$$ \begin{aligned} dZ^{[1]} &=dA^{[1]} * g'_1(Z^{[1]}) \\ &=dA^{[1]} * I(Z^{[1]} > 0). \end{aligned} $$

Finally we find the update for the weight parameter:

$$ dW^{[1]}_{(4\times 3)} = dZ^{[1]}_{(4\times m)} X\tr_{(m\times 3)}. $$

And the lastly, the update for $b^{[1]}$ is

$$ db^{[1]} = dZ^{[1]}_{(4\times m)} \bs{1}\tr_{(m\times 1)} = \sum_{i=1}^m \left[dZ^{[1]}\right]_i $$

To summarize, the gradient descent for this two-layered neural network for binary classification with sigmoid activation is done by the following algorithm.

Forward propagation:

$$ \begin{aligned} Z^{[1]} &= W^{[1]}X + b^{[1]} \\ A^{[1]} &= g_1(Z^{[1]}) \\ Z^{[2]} &= W^{[2]} A^{[1]} + b^{[2]} \\ A^{[2]} &= g_2(Z^{[2]}) \end{aligned} \tag{2} $$

Backward propagation:

$$ \begin{aligned} dA^{[2]} &= \frac{1}{m}\left[-\frac{Y}{A^{[2]}} +\frac{1-Y}{1-A^{[2]} } \right] \\ dZ^{[2]} &= dA^{[2]} * A^{[2]}*(1-A^{[2]}) \\ dW^{[2]} &= dZ^{[2]} {A^{[1]}}\tr \\ db^{[2]} &= \sum_{i=1}^m \left[dZ^{[2]}\right]_i \\ dA^{[1]} &= {W^{[2]}}\tr dZ^{[2]} \\ dZ^{[1]} &= dA^{[1]} * I(Z^{[1]}>0) \\ dW^{[1]} &= dZ^{[1]} X\tr \\ db^{[1]} &= \sum_{i=1}^m \left[dZ^{[1]}\right]_i \end{aligned} \tag{3} $$

Sigmoid Activation for Multiclass Classification

Previously we assumed that the final layer has only one activation node. A natural extension can be added to (2) and (3) to accomodate multiclass classification using the sigmoid activation at each of the output nodes. Suppose the final layer has multiple nodes, each gets activated by the sigmoid function to be a number between 0 and 1. (Note that they need not add up to 1.) We need to only slightly modify the loss function to be

$$ L(A^{[2]}, Y) =\frac{1}{m}\sum_{i,j} \left[ -\log (A^{[2]})*Y - \log(1-A^{[2]}) *(1-Y)\right], \tag{4} $$

where the notation $\sum_{i,j}$ denotes the summing of all components of a matrix. It follows that (3) holds exactly for multiclass classification using sigmoid activation. The only difference is in the dimension of $dA^{[2]}$.

Softmax Activation

The disadvantage of using sigmoid activation for multiclass classification is that the probabilities do not add up to 1. The softmax function is a vector valued function, $\text{softmax}: \mathbb{R}^{K\times m} \to \mathbb{R}^{K\times m}$, defined by

$$ A^{[2]} = \text{softmax}(Z^{[2]}) = \frac{e^{Z^{[2]}}}{\sum_{k=1}^K Z^{[2]}_k}, \quad Z_k \in \mathbb{R}^{1\times m}. $$

From an earlier post, we derived the cross-entropy loss function for the softmax regression. Given the final layer $Z^{[2]}_1,...,Z^{[2]}_K$, define the probability vector associated with the $i$th response value $Y^{(i)} \in \mathbb{R}^K$ as

$$\pi(Y^{(i)}) = \frac{e^{Z^{[2]}_{Y^{(i)}}}}{\sum_{k} e^{Z^{[2]}_{Y^{(k)}}}} \in \mathbb{R}^{1\times m}.$$

Note that the subscript depends on the value of $Y^{(i)}$ which takes value in one of the $K$ classes. Now we are ready to transform the softmax loss function derived earlier using the neural network notation:

$$ \begin{aligned} L(A^{[2]}, Y) &= -\frac{1}{m} \sum_{i=1}^m \log \pi(Y^{(i)}) \\ &= -\frac{1}{m} \sum_{i,j} \log A^{[2]} * Y. \end{aligned} \tag{5} $$

From (5), we can easily obtain its derivative:

$$ dA^{[2]} = -\frac{1}{m} \frac{Y}{A^{[2]}}. $$

To calculate the elementwise derivative $d\text{softmax}(Z^{[2]}) / dZ^{[2]}$, it helps to consider its first component in an univariate setting:

$$ \begin{aligned} \frac{d}{dz_1} \left(\frac{e^{z_1}}{e^{z_1} + \cdots + e^{z_K}} \right) &= \frac{e^{z_1}(e^{z_1} + \cdots +e^{c_k}) - e^{2z_1}}{(e^{z_1} + \cdots + e^{z_K})^2} \\ &= \frac{e^{z_1} - e^{z_1} A_1}{e^{z_1} + \cdots e^{z_K}} \\ &= A_1(1-A_1), \end{aligned} $$

where $A_1 = e^{z_1} / (e^{z_1} + \cdots + e^{z_K})$.

It follows that

$$ dZ^{[2]} = dA^{[2]} * A^{[2]} * (1-A^{[2]}), $$

which agrees with the sigmoid case!

Identity Activation for Regression

We conclude this post by addressing the easiest activation function: the identity function! In a regression problem, we take the value as is, without needing to transform it to a probability. The often used L2 loss function can be written as

$$ \begin{aligned} L(A^{[2]}, Y) &= \frac{1}{m} (A^{[2]} - Y) (A^{[2]} - Y)\tr \\ &= \frac{1}{m} (A^{[2]}{A^{[2]}}\tr - 2A^{[2]} Y\tr + Y Y\tr) \end{aligned} $$

Hence, the derivative can be easily computed as

$$ \begin{aligned} dA^{[2]} &= \frac{2}{m} (A^{[2]} - Y) = dZ^{[2]}. \end{aligned} $$

This concludes the math behind artificial neural networks. Although deep learning practitioners do not need these details to be successful, implementing something from scratch provides an entirely different level of understanding.