Understanding Model Evaluations

Elan Ding
Modified: June 17, 2018

If you understand basic statistics, then you have probably heard of Type I error and Type II error before. Here is an excellent illustration: $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$

We encounter two new terms in the picture: False Positive (FP) and False Negative (FN). It gets better! All of the following concepts are used to refer to the same concept in model evaluations:

  • $\alpha, \beta$
  • type I error, type II error
  • false positive, false negative
  • power, sensitivity, specificity
  • precision, recall

This is overwhelming at first, but in reality they all refer to the same concept which I will explain in this post.

Change of perspective

Classification modeling is nothing more than hypothesis testing, where the classification method is determined by the rejection region of the hypothesis test. So, let's quickly review hypothesis testing.

Definition. A hypothesis test is a statement:

$$ \begin{aligned} H_0 &: \theta \in \Theta_0 \\ H_1 &: \theta \in \Theta_0^c \end{aligned} $$

where $H_0$ is the null hypothesis, and $H_1$ is the alternative hypothesis. The set $\Theta_0$ is the null space, and the set $\Theta_0^c$ is the alternative space. The subset $R$ of the sample space $\mathcal{X}$ of the random vector $\bs{X}$ is called the rejection region. The function

$$ \delta(\bs{X}) = I(\bs{X}\in R) $$

where $I(\cdots)$ is the indicator function, is called the test function.

In terms of classification, the parameter $\theta$ is the variable $Y$ we are trying to classify. Let's use logistic regression for binary classification as an example. In this case, the model seeks the values $(\beta_0, \bs{\beta})$ that maximizes the likelihood function

$$ L(\beta_0, \bs{\beta}) = \prod_{i:Y_i=1} p(\bs{X}_i) \prod_{i:Y_i=0} (1-p(\bs{X}_i)) $$

where the vector $\bs{X}_i$ is the $i$th observed data, the scaler $Y_i$ takes on values 0 or 1, and

$$ \begin{aligned} p(\bs{X}) &= E(Y\,|\, \bs{X}) \\ &= P(Y=1\,|\, \bs{X}) \\ &= \frac{e^{\beta_0 + \bs{\beta}\tr \bs{X}}}{1+e^{\beta_0 + \bs{\beta}\tr\bs{X}}} = 1-P(Y=0\,|\, \bs{X}). \end{aligned} $$

Note that $\theta$ is unknown. It is the parameter that we are trying to estimate. Based on the observed data $\{\bs{X}_1, \cdots \bs{X}_n\}$, we can estimate it using the maximum likelihood estimator (MLE). After we have determined the MLE to be $\widehat{\theta}$, we can estimate the probability of any given data $\bs{X}$ by $\widehat{p}(\bs{X})$.

Next, we need to make a decision of whether the associated $Y$ value is 0 or 1. We define a threshhold $c$ such that if $\widehat{p}(\bs{X})>c$ we classify $Y$ as 1, and 0 otherwise. (This makes sense since $p(\bs{X})$ represents the probability that $Y=1$ given $\bs{X}$.)

If the test set has $n$ elements, classifying these elements is equivalent of running $n$ hypothesis tests, where the $i$th one is of the form

$$ \begin{aligned} H_0 &: Y_i = 0 \\ H_1 &: Y_i = 1 \end{aligned} $$

Since this is a simple hypothesis test, we can invoke the Neyman-Pearson Lemma, which states that for this type of hypothesis test, we can find the Uniformly Most Powerful (UMP) size-$\alpha$ test (i.e. the test that has the smallest false negative error rate given that the false positive error rate is $\alpha$). Don't worry if these don't make sense now. It will be made clear later. Basically we can find the best test to have the test function

$$ \delta(\bs{X}) = I\left(\frac{p(\bs{X})}{1-p(\bs{X})}>k\right). $$

Using a little algebra, with a different threshold $c$, we can show that this test function is equivalent to

$$ \delta(\bs{X}) = I(p(\bs{X}) >c). $$

(As a sidenote, if we take log of the ratio $p(\bs{X})/(1-p(\bs{X}))$ in the original test function, and by noting that log is a monotone function, we get a rejection region equivalent to

$$\beta_0 + \bs{\beta}\tr \bs{X} > \ln k.$$

This is a linear function! This is why logistic regression (like linear discriminat analysis) have a linear decision boundary. Neat, right?

In most cases, we choose $c$ to be 0.5 since this makes the most intuitive sense. By varying the values of $c$, we are changing the ability of the test to make a rejection.

The power function

We are now ready to define Type I error and Type II error.

Definition. The Type I error is defined by

$$ P_{\theta}(\bs{X}\in R ), \quad \text{ for } \theta \in \Theta_0, $$

and the Type II error is defined by

$$ P_{\theta}(\bs{X}\in R^c), \quad \text{ for } \theta \in \Theta_0^c. $$

In words, the Type I error is the probability of falsely rejecting the null hypothesis, and the Type II error is the probability of failing to reject a false null hypothesis. In this regard, the Type I error is the same as False Postive (FP or $\alpha$) (think of the word "positive" as making a rejection), and Type II error is the same as False Negative (FN or $\beta$).

When I was in kindergarten (joking), I was always under the impression that $\alpha + \beta=1$. Intuitively, as we are making the model more "strict" by reducing its ability to make rejections (false positive rate ($\alpha$) will decrease), and the model will make more false negative claims (false negative ($\beta$) will increase). The inverse relationship, in most relevant cases, is true. However, $\alpha + \beta$ is not necessarily equal to 1.

We define the power function to be

Definition. The power function of a test measures its ability to make a rejection. It is defined as:

$$ \beta(\theta) = E_{\theta}[\delta(\bs{X})] = P_{\theta}(\bs{X}\in R). $$

From Dr. Chris McMahan's excellent MATH 8040 notes, one example I particularly like is the following.

Example. Let $X\sim \text{Binomial}(5,\theta)$, and consider the hypothesis test

$$ \begin{aligned} H_0 &: \theta \leq 1/2 \\ H_1 &: \theta > 1/2 \end{aligned}

$$ with the two rejection regions:

$$ \begin{aligned} R_1 &= \{x\,: \, x=5\}\\ R_2 &= \{x\,: \, x\geq 3\}. \end{aligned} $$

The power functions of these two rejection regions are

$$ \begin{aligned} \beta_1(\theta) &= P_{\theta}(X=5) = \theta^5 \\ \beta_2(\theta) &= P_{\theta}(X\geq 3) = 10\theta^3 (1-\theta)^2 + 5\theta^4 (1-\theta) + \theta^5 \end{aligned} $$

The following plot from Dr. McMahan's notes is the best plot I have ever seen in terms of explaining the concept of model evaluation:

From the plot, we see that the red curve ($R_2$) is strickly above the black curve ($R_1$). The black curve has a lower probability of making a rejection than the red curve. This makes sense since it is a lot harder to land in $R_1$ than $R_2$.

If $\theta \in \Theta_0$, the red curve has a higher FP rate than the black curve. However, when $\theta \in \Theta_0^c$, the rate curve has a lower FN rate than the black curve. Equivalently we can say that the red curve has a higher sensitivity (or power) and a lower specificity than the black curve. I made some additions to the plot:

The more sensitive a model is, the less likely it is to overlook a false statement. Think of the hypothesis test as a judgemental person, and if the person is very sensitive, she is easily irritated or offended. As soon as a wrong word come out of your mouth, you will be rejected! :-) Additionally, we also refer to sensitivity as the power of a test.

On the other hand, the more specific a model is, the less likely it is to reject a true statement. I am unable to think of an equally good analogy here in terms of the choice of word. I remember it this way: The more specific the description is, the better it is at verifying a true statement.

Let's summarize this section with a diagram:

In sklearn, we can use the confusion_matrix to easily obtain this table.

In [1]:
from sklearn.metrics import confusion_matrix

y_true = [1, 0, 1, 1, 0, 1]
y_pred = [0, 0, 1, 0, 0, 1]

confusion_matrix(y_true, y_pred).transpose()
Out[1]:
array([[2, 2],
       [0, 2]])

Note that the original confusion matrix is the transpose of the diagram I used above; hence I transposed it to make it match the diagram. We see that there is no false positive predictions. This is correct since out of the two 1 predictions, the true values are also 1. On the other hand, there are 2 false negatives predictions. Out of the four 0 predictions, two of them are incorrect.

Precision and recall

Another handy function from sklearn library is the classification_report.

In [2]:
from sklearn.metrics import classification_report

target_names = ['negative', 'positive']

print(classification_report(y_true, y_pred, target_names=target_names))
             precision    recall  f1-score   support

   negative       0.50      1.00      0.67         2
   positive       1.00      0.50      0.67         4

avg / total       0.83      0.67      0.67         6

Here we encounter several new vocabularies. In particular, we see precision and recall. A quick search in wikipedia yields the following excellent illustration:

In short, the term recall is the same as sensitivity. The term precision is similar to sensitivity in the sense that both are positively affected by the power of the test. The only difference is that precision is negatively affected by total number of rejections, whereas sensitivity is negatively affected by the number of points that are in $\Theta_0^c$ (the relevant items).

Returning to the previous example, where

$$ \begin{aligned} \text{truth} &= \{1, 0, 1, 1, 0, 1\} \\ \text{prediction} &= \{0, 0, 1, 0, 0, 1\} \end{aligned} $$

The second row of the classification report gives a precision of 1 and recall of 0.5. This agrees with our previous analysis. Previously, we find that there are no false negatives. In other words, out of all those (2) points that we predict to be positive, both are correct predictions. Hence, we have 100% precision. However, out of all (4) points that are truly positive, we are only able to predict 2 of them. Hence, we have 50% recall or sensitivity or power.

In general we want a balance between precision and recall!