The Cross Entropy and Maximum Likelihood

Elan Ding

Modified: July 4, 2018 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$

The goal of this post is to briefly explain the beautiful connection between maximum likelihood estimator and the cost function. I was greatly inspired by the deep learning book, which offers many great insights. The maximum likelihood estimator (MLE), obtained by maximizing the likelihood function, is essentially a minimization of the difference between the empirical distribution, defined by the training set, and the modeled distribution. This difference can be formally defined by the cross entropy between distributions. In particular, the mean squared error is simply a measure of the cross entropy between the empirical distribution and the Gaussian distribution. This is remarkable. Let's get started!

Least Squares as Maximum Likelihood

We all know that the parameters of the least squares model $h_{\theta}(\bs{x}) = \bs{x}\tr \theta$, is derived by minimizing the mean squared error:

$$ MSE = \frac{1}{m} \sum_{i=1}^m(y^{(i)}- h_{\theta}(\bs{x}^{(i)}))^2 $$

The solution is easily obtained by calculus to be the normal equations. However, this error measure, or cost function, is chosen more or less arbitrarily.

In reality, this is far from arbitrary as it can be justified by the maximum likelihood procedure. In order to apply the concept of the maximum likelihood estimation, we must embue our model with a probabilistic interpretation. Because of the ubiquitous effect of the Central Limit Theorem, it is very reasonable to assume that $p(y\given \bs{x})$ follows a Gaussian distribution. That is,

$$ p_{\theta}(y\given \bs{x}) = \frac{1}{\sqrt{2\pi}\sigma} \text{exp}\left(-\frac{(y-\bs{x}\tr\theta)^2}{2\sigma^2} \right) $$

(This is equivalent of assuming that $y=\bs{x}\tr \theta + \epsilon$, where $\epsilon$ follows a Gaussian distribution with mean equal to 0.)

Given a set of training data $\{(\bs{x}^{(1)}, y^{(1)}), ..., (\bs{x}^{(m)}, y^{(m)})\}$, the log-likelihood is given by

$$ \begin{aligned} l(\theta) &=\sum_{i=1}^m \log p_{\theta}(y^{(i)}\given \bs{x}^{(i)}) \\ &= -m\log \sigma - \frac{m}{2}\log(2\pi) - \sum_{i=1}^m \frac{( y^{(i)}-{\bs{x}^{(i)}}\tr\theta )^2}{2\sigma^2} \end{aligned} $$

Immediately we see that maximizing the log-likelihood with respect to $\theta$ is equivalent to minimizing the mean squared error.

Maximum Likelihood and KL Divergence

Statistics is concerend with estimating the unknown data-generating distribution $p(\bs{x})$ using the empirical distribution from the observed data $\widehat{p}(\bs{x})$. We then come up with our modeled distribution in a parametric family $p_{\text{model}}(\bs{x}; \theta)$ indexed by the parameter $\theta$.

Our choice of the parameter $\theta$ determines the model's flexibility. Theoretically, we can make $p_{\text{modal}}(\bs{x}; \theta)$ identical to the empirical distribution $\widehat{p}(\bs{x})$, if our model is flexible enough. However, more flexibility is not always desired due to the bias-variance tradeoff. Given a fixed model, the maximum likelihood estimation is the most popular technique in estimating the parameter $\theta$. By the iid assumption of the sample, we define

$$ \theta_{\text{MLE}} = \underset{\theta}{\text{arg max}} \sum_{i=1}^m \log p_{\text{model}}(\bs{x}^{(i)}; \theta). $$

The maximum is unaffected if divide this quantity by $m$, so we can express the MLE in terms of an expectation:

$$ \theta_{\text{MLE}} = \underset{\theta}{\text{arg max}}\, E_{\bs{x}\sim \widehat{p}}\left[\log p_{\text{model}}(\bs{x}; \theta)\right]. $$

where the expectation is taken with respect to the empirical distribution $\widehat{p}(\bs{x})$. Now we are ready to define the KL divergence.

Definition. The KL divergence between the empirical distribution $\widehat{p}$ and the model distribution $p_{\text{model}}$ is defined by

$$ D_{\text{KL}}(\widehat{p}, p_{\text{model}}) = E_{\bs{x}\sim \widehat{p}} \left[\log \widehat{p}(\bs{x})-\log p_{\text{model}}(\bs{x}; \theta) \right] $$

The KL divergence measures the dissimilarity between two distribution. To minimize it, note that the term $\log \widehat{p}(\bs{x})$ does not depend on the parameter $\theta$, hence we only need to minimize the negative log-likelihood:

$$ -E_{\bs{x}\sim \widehat{p}} \left[\log p_{\text{model}}(\bs{x}; \theta)\right]. $$

This is the same as maximizing the log-likelihood, which is the same as finding the maximum likelihood estimator!

It is remarkable that the following actions are equivalent:

  • maximize the likelihood
  • maximize the log-likelihood
  • minimize the negative log-likelihood
  • minimize the MSE
  • minimize the KL divergence

Logistic Regression with Gradient Descent

We end this post by finding the parameters of the logistic regression model via minimizing the negative log-likelihood. The logistic regression uses the sigmoid function as its activation function. The parameters of the model is given by $\theta = (\bs{w}, b)$ where $\bs{w}$ is the weight vector and $b$ is the bias. Given a training example $\bs{x}^{(i)}$, the algorithm first computes the value of

$$ z^{(i)} = \bs{w}\tr \bs{x}^{(i)} +b $$

Next, it feeds this value into the sigmoid function:

$$ \widehat{y}^{(i)} = a^{(i)} = \sigma(z^{(i)}) $$

For any particular training example, the likelihood is given by

$$ p(y\given x) = a^y (1-a)^{(1-y)} $$

where $a = p(y=1\given x) = \sigma(\bs{w}\tr \bs{x} +b)$

Taking log, we find that the log-likelihood to be

$$ l(a, y) = y\log(a) + (1-y)\log(1-a) $$

Hence, for the training set $\{(\bs{x}^{(1)}, y^{(1)}), ..., (\bs{x}^{(m)}, y^{(m)})\}$, we define the loss function to be proportional to the negative log-likelihood:

$$ J(\theta) = \frac{1}{m} \sum_{i=1}^m \left[-y^{(i)}\log(a^{(i)}) - (1-y^{(i)})\log(1-a^{(i)})\right] $$

Why isn't MSE used for the loss function? It turns out that the function

$$ MSE = \frac{1}{m} \sum_{i=1}^m (y^{(i)} - a^{(i)})^2 $$

is non-convex. This means that there might be multiple local minimums, which limits the effectiveness of the gradient descent algorithm.

For a single training example $(\bs{x}^{(i)}, y^{(i)})$, we can use the chain rule to find the derivative of the loss function:

$$ \frac{dJ}{d\bs{w}} = (a^{(i)}-y^{(i)})\bs{x}^{(i)} $$

For a full set of $m$ training examples, it is more convenient to use the vectorized notation, by letting $\bs{X}$ be the matrix whose columns are the input vectors, and define

$$ \begin{aligned} \bs{b} &= [b,...,b] \\ \bs{z} &= [z^{(1)}, ..., z^{(m)}] \\ \bs{a} &= [a^{(1)},..., a^{(m)}] \\ \bs{y} &= [y^{(1)},..., y^{(m)}] \end{aligned} $$

be row vectors. Given a set of traning data, our model now produces a row vector of probabilities:

$$ \bs{a} = \sigma(\bs{z}) = \sigma(\bs{w}\tr \bs{X} + \bs{b}) $$

To find the gradient of $J(\theta)$ we use back propagation and obtain the vectorized solution

$$ \begin{aligned} \frac{dJ}{d\bs{w}} &= \frac{1}{m} \bs{X}(\bs{a}-\bs{y})\tr \\ \frac{dJ}{db} &= \frac{1}{m} \sum_{i=1}^m (a^{(i)} - y^{(i)}) \end{aligned} $$

Next we can use gradient descent by constinuous updating the parameters via

$$ \begin{aligned} \bs{w} &:= \bs{w}-\alpha \frac{dJ}{d\bs{w}} \\ b &:= b - \alpha \frac{dJ}{db} \end{aligned} $$

where $\alpha >0$ is our learning rate.

Here is its implementation in Python. We use the same spam vs non-spam dataset as the previous post.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
%matplotlib inline
In [2]:
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data", header=None)
In [3]:
from sklearn.model_selection import train_test_split
In [4]:
X_train_orig, X_test_orig, y_train_orig, y_test_orig = train_test_split(df.drop(57, axis=1), df[57], test_size=0.33)
In [5]:
m_train = X_train_orig.shape[0]
m_test = X_test_orig.shape[0]

print ("Number of training examples: m_train = " + str(m_train))
print ("Number of testing examples: m_test = " + str(m_test))
Number of training examples: m_train = 3082
Number of testing examples: m_test = 1519
In [6]:
# standardize dataset
from sklearn import preprocessing

X_train = preprocessing.scale(X_train_orig).T
X_test = preprocessing.scale(X_test_orig).T
y_train = np.array(y_train_orig).T
y_test = np.array(y_test_orig).T
In [7]:
# Define sigmoid function
def sigmoid(z):

    s = 1 / (1 + np.exp(-z))

    return s
In [8]:
# Define the initial state of gradient descent
def initialize_with_zeros(dim):

    w = np.zeros(shape=(dim, 1))
    b = 0

    assert(w.shape == (dim, 1))
    assert(isinstance(b, float) or isinstance(b, int))

    return w, b
In [9]:
# Define forward and back propagation
def propagate(w, b, X, Y):

    m = X.shape[1]
    A = sigmoid(np.dot(w.T, X) + b)
    cost = (- 1 / m) * np.sum(Y * np.log(A) + (1 - Y) * (np.log(1 - A)))

    dw = (1 / m) * np.dot(X, (A - Y).T)
    db = (1 / m) * np.sum(A - Y)

    assert(dw.shape == w.shape)
    assert(db.dtype == float)
    cost = np.squeeze(cost)
    assert(cost.shape == ())

    grads = {"dw": dw, "db": db}

    return grads, cost
In [10]:
# Define the optimization function
def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost = False):

    costs = []

    for i in range(num_iterations):

        grads, cost = propagate(w, b, X, Y)

        dw = grads["dw"]
        db = grads["db"]

        w = w - learning_rate * dw
        b = b - learning_rate * db

        if i % 100 == 0:
            costs.append(cost)

        if print_cost and i % 100 == 0:
            print ("Cost after iteration %i: %f" % (i, cost))

    params = {"w": w, "b": b}

    grads = {"dw": dw, "db": db}

    return params, grads, costs
In [11]:
# Define the prediction function
def predict(w, b, X):

    m = X.shape[1]
    Y_prediction = np.zeros((1, m))
    w = w.reshape(X.shape[0], 1)

    A = sigmoid(np.dot(w.T, X) + b)

    for i in range(A.shape[1]):

        Y_prediction[0, i] = 1 if A[0, i] > 0.5 else 0

    assert(Y_prediction.shape == (1, m))

    return Y_prediction
In [12]:
# Define the final logistic regression model
def model(X_train, Y_train, X_test, Y_test, num_iterations=2000, learning_rate=0.5, print_cost=False):

    w, b = initialize_with_zeros(X_train.shape[0])

    # Gradient descent
    parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost)

    w = parameters["w"]
    b = parameters["b"]

    # Predict test/train set examples 
    Y_prediction_test = predict(w, b, X_test)
    Y_prediction_train = predict(w, b, X_train)

    # Print train/test Errors
    print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100))
    print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100))


    d = {"costs": costs,
         "Y_prediction_test": Y_prediction_test,
         "Y_prediction_train" : Y_prediction_train,
         "w" : w,
         "b" : b,
         "learning_rate" : learning_rate,
         "num_iterations": num_iterations}

    return d
In [13]:
d = model(X_train, y_train, X_test, y_test, num_iterations = 2000, learning_rate = 0.005, print_cost = True)
Cost after iteration 0: 0.693147
Cost after iteration 100: 0.540237
Cost after iteration 200: 0.466858
Cost after iteration 300: 0.423622
Cost after iteration 400: 0.394911
Cost after iteration 500: 0.374321
Cost after iteration 600: 0.358750
Cost after iteration 700: 0.346507
Cost after iteration 800: 0.336590
Cost after iteration 900: 0.328368
Cost after iteration 1000: 0.321420
Cost after iteration 1100: 0.315455
Cost after iteration 1200: 0.310267
Cost after iteration 1300: 0.305704
Cost after iteration 1400: 0.301650
Cost after iteration 1500: 0.298020
Cost after iteration 1600: 0.294744
Cost after iteration 1700: 0.291769
Cost after iteration 1800: 0.289052
Cost after iteration 1900: 0.286558
train accuracy: 91.07722258273849 %
test accuracy: 91.57340355497038 %