*Modified*: July 4, 2018
$\newcommand{\bs}{\boldsymbol}$
$\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$
$\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$
$\newcommand{\tr}{^{\top}}$
$\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$
$\newcommand{\given}{\,|\,}$

The goal of this post is to briefly explain the beautiful connection between maximum likelihood estimator and the cost function. I was greatly inspired by the deep learning book, which offers many great insights. The maximum likelihood estimator (MLE), obtained by maximizing the likelihood function, is essentially a minimization of the difference between the empirical distribution, defined by the training set, and the modeled distribution. This difference can be formally defined by the `cross entropy`

between distributions. In particular, the `mean squared error`

is simply a measure of the cross entropy between the empirical distribution and the Gaussian distribution. This is remarkable. Let's get started!

We all know that the parameters of the least squares model $h_{\theta}(\bs{x}) = \bs{x}\tr \theta$, is derived by minimizing the *mean squared error*:

$$ MSE = \frac{1}{m} \sum_{i=1}^m(y^{(i)}- h_{\theta}(\bs{x}^{(i)}))^2 $$

The solution is easily obtained by calculus to be the normal equations. However, this error measure, or `cost function`

, is chosen more or less arbitrarily.

In reality, this is far from arbitrary as it can be justified by the maximum likelihood procedure. In order to apply the concept of the maximum likelihood estimation, we must embue our model with a probabilistic interpretation. Because of the ubiquitous effect of the Central Limit Theorem, it is very reasonable to assume that $p(y\given \bs{x})$ follows a Gaussian distribution. That is,

$$ p_{\theta}(y\given \bs{x}) = \frac{1}{\sqrt{2\pi}\sigma} \text{exp}\left(-\frac{(y-\bs{x}\tr\theta)^2}{2\sigma^2} \right) $$

(This is equivalent of assuming that $y=\bs{x}\tr \theta + \epsilon$, where $\epsilon$ follows a Gaussian distribution with mean equal to 0.)

Given a set of training data $\{(\bs{x}^{(1)}, y^{(1)}), ..., (\bs{x}^{(m)}, y^{(m)})\}$, the log-likelihood is given by

$$ \begin{aligned} l(\theta) &=\sum_{i=1}^m \log p_{\theta}(y^{(i)}\given \bs{x}^{(i)}) \\ &= -m\log \sigma - \frac{m}{2}\log(2\pi) - \sum_{i=1}^m \frac{( y^{(i)}-{\bs{x}^{(i)}}\tr\theta )^2}{2\sigma^2} \end{aligned} $$

Immediately we see that maximizing the log-likelihood with respect to $\theta$ is equivalent to minimizing the mean squared error.

Statistics is concerend with estimating the unknown data-generating distribution $p(\bs{x})$ using the empirical distribution from the observed data $\widehat{p}(\bs{x})$. We then come up with our modeled distribution in a parametric family $p_{\text{model}}(\bs{x}; \theta)$ indexed by the parameter $\theta$.

Our choice of the parameter $\theta$ determines the model's flexibility. Theoretically, we can make $p_{\text{modal}}(\bs{x}; \theta)$ identical to the empirical distribution $\widehat{p}(\bs{x})$, if our model is flexible enough. However, more flexibility is not always desired due to the bias-variance tradeoff. Given a fixed model, the maximum likelihood estimation is the most popular technique in estimating the parameter $\theta$. By the iid assumption of the sample, we define

$$ \theta_{\text{MLE}} = \underset{\theta}{\text{arg max}} \sum_{i=1}^m \log p_{\text{model}}(\bs{x}^{(i)}; \theta). $$

The maximum is unaffected if divide this quantity by $m$, so we can express the MLE in terms of an expectation:

$$ \theta_{\text{MLE}} = \underset{\theta}{\text{arg max}}\, E_{\bs{x}\sim \widehat{p}}\left[\log p_{\text{model}}(\bs{x}; \theta)\right]. $$

where the expectation is taken with respect to the empirical distribution $\widehat{p}(\bs{x})$. Now we are ready to define the *KL divergence*.

Definition. TheKL divergencebetween the empirical distribution $\widehat{p}$ and the model distribution $p_{\text{model}}$ is defined by$$ D_{\text{KL}}(\widehat{p}, p_{\text{model}}) = E_{\bs{x}\sim \widehat{p}} \left[\log \widehat{p}(\bs{x})-\log p_{\text{model}}(\bs{x}; \theta) \right] $$

The KL divergence measures the dissimilarity between two distribution. To minimize it, note that the term $\log \widehat{p}(\bs{x})$ does not depend on the parameter $\theta$, hence we only need to minimize the `negative log-likelihood`

:

$$ -E_{\bs{x}\sim \widehat{p}} \left[\log p_{\text{model}}(\bs{x}; \theta)\right]. $$

This is the same as maximizing the log-likelihood, which is the same as finding the maximum likelihood estimator!

It is remarkable that the following actions are equivalent:

- maximize the likelihood
- maximize the log-likelihood
- minimize the negative log-likelihood
- minimize the MSE
- minimize the KL divergence

We end this post by finding the parameters of the logistic regression model via minimizing the negative log-likelihood. The logistic regression uses the sigmoid function as its activation function. The parameters of the model is given by $\theta = (\bs{w}, b)$ where $\bs{w}$ is the weight vector and $b$ is the bias. Given a training example $\bs{x}^{(i)}$, the algorithm first computes the value of

$$ z^{(i)} = \bs{w}\tr \bs{x}^{(i)} +b $$

Next, it feeds this value into the sigmoid function:

$$ \widehat{y}^{(i)} = a^{(i)} = \sigma(z^{(i)}) $$

For any particular training example, the *likelihood* is given by

$$ p(y\given x) = a^y (1-a)^{(1-y)} $$

where $a = p(y=1\given x) = \sigma(\bs{w}\tr \bs{x} +b)$

Taking log, we find that the *log-likelihood* to be

$$ l(a, y) = y\log(a) + (1-y)\log(1-a) $$

Hence, for the training set $\{(\bs{x}^{(1)}, y^{(1)}), ..., (\bs{x}^{(m)}, y^{(m)})\}$, we define the loss function to be proportional to the *negative log-likelihood*:

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^m \left[-y^{(i)}\log(a^{(i)}) - (1-y^{(i)})\log(1-a^{(i)})\right] $$

Why isn't MSE used for the loss function? It turns out that the function

$$ MSE = \frac{1}{m} \sum_{i=1}^m (y^{(i)} - a^{(i)})^2 $$

is non-convex. This means that there might be multiple local minimums, which limits the effectiveness of the gradient descent algorithm.

For a single training example $(\bs{x}^{(i)}, y^{(i)})$, we can use the chain rule to find the derivative of the loss function:

$$ \frac{dJ}{d\bs{w}} = (a^{(i)}-y^{(i)})\bs{x}^{(i)} $$

For a full set of $m$ training examples, it is more convenient to use the vectorized notation, by letting $\bs{X}$ be the matrix whose columns are the input vectors, and define

$$ \begin{aligned} \bs{b} &= [b,...,b] \\ \bs{z} &= [z^{(1)}, ..., z^{(m)}] \\ \bs{a} &= [a^{(1)},..., a^{(m)}] \\ \bs{y} &= [y^{(1)},..., y^{(m)}] \end{aligned} $$

be row vectors. Given a set of traning data, our model now produces a row vector of probabilities:

$$ \bs{a} = \sigma(\bs{z}) = \sigma(\bs{w}\tr \bs{X} + \bs{b}) $$

To find the gradient of $J(\theta)$ we use back propagation and obtain the vectorized solution

$$ \begin{aligned} \frac{dJ}{d\bs{w}} &= \frac{1}{m} \bs{X}(\bs{a}-\bs{y})\tr \\ \frac{dJ}{db} &= \frac{1}{m} \sum_{i=1}^m (a^{(i)} - y^{(i)}) \end{aligned} $$

Next we can use *gradient descent* by constinuous updating the parameters via

$$ \begin{aligned} \bs{w} &:= \bs{w}-\alpha \frac{dJ}{d\bs{w}} \\ b &:= b - \alpha \frac{dJ}{db} \end{aligned} $$

where $\alpha >0$ is our *learning rate*.

Here is its implementation in Python. We use the same spam vs non-spam dataset as the previous post.

In [1]:

```
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
```

In [2]:

```
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", header=None)
```

In [3]:

```
from sklearn.model_selection import train_test_split
```

In [4]:

```
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(df.drop(57, axis=1), df[57], test_size=0.33)
```

In [5]:

```
m_train = X_train_orig.shape[0]
m_test = X_test_orig.shape[0]
print ("Number of training examples: m_train = " + str(m_train))
print ("Number of testing examples: m_test = " + str(m_test))
```

In [6]:

```
# standardize dataset
from sklearn import preprocessing
X_train = preprocessing.scale(X_train_orig).T
X_test = preprocessing.scale(X_test_orig).T
y_train = np.array(y_train_orig).T
y_test = np.array(y_test_orig).T
```

In [7]:

```
# Define sigmoid function
def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
```

In [8]:

```
# Define the initial state of gradient descent
def initialize_with_zeros(dim):
w = np.zeros(shape=(dim, 1))
b = 0
assert(w.shape == (dim, 1))
assert(isinstance(b, float) or isinstance(b, int))
return w, b
```

In [9]:

```
# Define forward and back propagation
def propagate(w, b, X, Y):
m = X.shape[1]
A = sigmoid(np.dot(w.T, X) + b)
cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))
dw = (1 / m) * np.dot(X, (A - Y).T)
db = (1 / m) * np.sum(A - Y)
assert(dw.shape == w.shape)
assert(db.dtype == float)
cost = np.squeeze(cost)
assert(cost.shape == ())
grads = {"dw": dw, "db": db}
return grads, cost
```

In [10]:

```
# Define the optimization function
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):
costs = []
for i in range(num_iterations):
grads, cost = propagate(w, b, X, Y)
dw = grads["dw"]
db = grads["db"]
w = w - learning_rate * dw
b = b - learning_rate * db
if i % 100 == 0:
costs.append(cost)
if print_cost and i % 100 == 0:
print ("Cost after iteration %i: %f" % (i, cost))
params = {"w": w, "b": b}
grads = {"dw": dw, "db": db}
return params, grads, costs
```

In [11]:

```
# Define the prediction function
def predict(w, b, X):
m = X.shape[1]
Y_prediction = np.zeros((1, m))
w = w.reshape(X.shape[0], 1)
A = sigmoid(np.dot(w.T, X) + b)
for i in range(A.shape[1]):
Y_prediction[0, i] = 1 if A[0, i] > 0.5 else 0
assert(Y_prediction.shape == (1, m))
return Y_prediction
```

In [12]:

```
# Define the final logistic regression model
def model(X_train, Y_train, X_test, Y_test, num_iterations=2000, learning_rate=0.5, print_cost=False):
w, b = initialize_with_zeros(X_train.shape[0])
# Gradient descent
parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost)
w = parameters["w"]
b = parameters["b"]
# Predict test/train set examples
Y_prediction_test = predict(w, b, X_test)
Y_prediction_train = predict(w, b, X_train)
# Print train/test Errors
print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))
d = {"costs": costs,
"Y_prediction_test": Y_prediction_test,
"Y_prediction_train" : Y_prediction_train,
"w" : w,
"b" : b,
"learning_rate" : learning_rate,
"num_iterations": num_iterations}
return d
```

In [13]:

```
d = model(X_train, y_train, X_test, y_test, num_iterations = 2000, learning_rate = 0.005, print_cost = True)
```