**Elan Ding**

*Modified*: June 17, 2018

If you understand basic statistics, then you have probably heard of `Type I error`

and `Type II error`

before. Here is an excellent illustration:
$\newcommand{\bs}{\boldsymbol}$
$\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$
$\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$
$\newcommand{\tr}{^{\top}}$
$\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$

We encounter two new terms in the picture: `False Positive (FP)`

and `False Negative (FN)`

. It gets better! All of the following concepts are used to refer to the same concept in model evaluations:

- $\alpha, \beta$
- type I error, type II error
- false positive, false negative
- power, sensitivity, specificity
- precision, recall

This is overwhelming at first, but in reality they all refer to the same concept which I will explain in this post.

Classification modeling is nothing more than hypothesis testing, where the classification method is determined by the rejection region of the hypothesis test. So, let's quickly review hypothesis testing.

Definition. Ahypothesis testis a statement:$$ \begin{aligned} H_0 &: \theta \in \Theta_0 \\ H_1 &: \theta \in \Theta_0^c \end{aligned} $$

where $H_0$ is the

null hypothesis, and $H_1$ is thealternative hypothesis. The set $\Theta_0$ is thenull space, and the set $\Theta_0^c$ is thealternative space. The subset $R$ of the sample space $\mathcal{X}$ of the random vector $\bs{X}$ is called therejection region. The function$$ \delta(\bs{X}) = I(\bs{X}\in R) $$

where $I(\cdots)$ is the indicator function, is called the

test function.

In terms of classification, the parameter $\theta$ is the variable $Y$ we are trying to classify. Let's use logistic regression for binary classification as an example. In this case, the model seeks the values $(\beta_0, \bs{\beta})$ that maximizes the likelihood function

$$ L(\beta_0, \bs{\beta}) = \prod_{i:Y_i=1} p(\bs{X}_i) \prod_{i:Y_i=0} (1-p(\bs{X}_i)) $$

where the vector $\bs{X}_i$ is the $i$th observed data, the scaler $Y_i$ takes on values 0 or 1, and

$$ \begin{aligned} p(\bs{X}) &= E(Y\,|\, \bs{X}) \\ &= P(Y=1\,|\, \bs{X}) \\ &= \frac{e^{\beta_0 + \bs{\beta}\tr \bs{X}}}{1+e^{\beta_0 + \bs{\beta}\tr\bs{X}}} = 1-P(Y=0\,|\, \bs{X}). \end{aligned} $$

Note that $\theta$ is unknown. It is the parameter that we are trying to estimate. Based on the observed data $\{\bs{X}_1, \cdots \bs{X}_n\}$, we can *estimate* it using the *maximum likelihood estimator* (MLE). After we have determined the MLE to be $\widehat{\theta}$, we can estimate the probability of any given data $\bs{X}$ by $\widehat{p}(\bs{X})$.

Next, we need to make a decision of whether the associated $Y$ value is 0 or 1. We define a *threshhold* $c$ such that if $\widehat{p}(\bs{X})>c$ we classify $Y$ as 1, and 0 otherwise. (This makes sense since $p(\bs{X})$ represents the probability that $Y=1$ given $\bs{X}$.)

If the test set has $n$ elements, classifying these elements is equivalent of running $n$ hypothesis tests, where the $i$th one is of the form

$$ \begin{aligned} H_0 &: Y_i = 0 \\ H_1 &: Y_i = 1 \end{aligned} $$

Since this is a simple hypothesis test, we can invoke the `Neyman-Pearson Lemma`

, which states that for this type of hypothesis test, we can find the `Uniformly Most Powerful`

(UMP) size-$\alpha$ test (i.e. the test that has the smallest false negative error rate given that the false positive error rate is $\alpha$). Don't worry if these don't make sense now. It will be made clear later. Basically we can find the *best* test to have the test function

$$ \delta(\bs{X}) = I\left(\frac{p(\bs{X})}{1-p(\bs{X})}>k\right). $$

Using a little algebra, with a different *threshold* $c$, we can show that this test function is equivalent to

$$ \delta(\bs{X}) = I(p(\bs{X}) >c). $$

(As a sidenote, if we take log of the ratio $p(\bs{X})/(1-p(\bs{X}))$ in the original test function, and by noting that log is a monotone function, we get a rejection region equivalent to

$$\beta_0 + \bs{\beta}\tr \bs{X} > \ln k.$$

This is a linear function! This is why logistic regression (like linear discriminat analysis) have a linear `decision boundary`

. Neat, right?

In most cases, we choose $c$ to be 0.5 since this makes the most intuitive sense. By varying the values of $c$, we are changing the ability of the test to make a rejection.

We are now ready to define `Type I error`

and `Type II error`

.

Definition. TheType I erroris defined by$$ P_{\theta}(\bs{X}\in R ), \quad \text{ for } \theta \in \Theta_0, $$

and the

Type II erroris defined by$$ P_{\theta}(\bs{X}\in R^c), \quad \text{ for } \theta \in \Theta_0^c. $$

In words, the **Type I error** is the probability of falsely rejecting the null hypothesis, and the **Type II error** is the probability of failing to reject a false null hypothesis. In this regard, the Type I error is the same as `False Postive`

(FP or $\alpha$) (think of the word "positive" as making a rejection), and Type II error is the same as `False Negative`

(FN or $\beta$).

When I was in kindergarten (joking), I was always under the impression that $\alpha + \beta=1$. Intuitively, as we are making the model more "strict" by reducing its ability to make rejections (false positive rate ($\alpha$) will decrease), and the model will make more false negative claims (false negative ($\beta$) will increase). The inverse relationship, in most relevant cases, is true. However, $\alpha + \beta$ is not necessarily equal to 1.

We define the power function to be

Definition. Thepower functionof a test measures its ability to make a rejection. It is defined as:$$ \beta(\theta) = E_{\theta}[\delta(\bs{X})] = P_{\theta}(\bs{X}\in R). $$

From Dr. Chris McMahan's excellent MATH 8040 notes, one example I particularly like is the following.

Example. Let $X\sim \text{Binomial}(5,\theta)$, and consider the hypothesis test$$ \begin{aligned} H_0 &: \theta \leq 1/2 \\ H_1 &: \theta > 1/2 \end{aligned}

$$ with the two rejection regions:

$$ \begin{aligned} R_1 &= \{x\,: \, x=5\}\\ R_2 &= \{x\,: \, x\geq 3\}. \end{aligned} $$

The power functions of these two rejection regions are

$$ \begin{aligned} \beta_1(\theta) &= P_{\theta}(X=5) = \theta^5 \\ \beta_2(\theta) &= P_{\theta}(X\geq 3) = 10\theta^3 (1-\theta)^2 + 5\theta^4 (1-\theta) + \theta^5 \end{aligned} $$

The following plot from Dr. McMahan's notes is the best plot I have ever seen in terms of explaining the concept of model evaluation:

From the plot, we see that the red curve ($R_2$) is strickly above the black curve ($R_1$). The black curve has a lower probability of making a rejection than the red curve. This makes sense since it is a lot harder to land in $R_1$ than $R_2$.

If $\theta \in \Theta_0$, the red curve has a higher FP rate than the black curve. However, when $\theta \in \Theta_0^c$, the rate curve has a *lower* FN rate than the black curve. Equivalently we can say that the red curve has a higher **sensitivity** (or **power**) and a lower **specificity** than the black curve. I made some additions to the plot:

The more *sensitive* a model is, the less likely it is to overlook a false statement. Think of the hypothesis test as a judgemental person, and if the person is very sensitive, she is easily irritated or offended. As soon as a wrong word come out of your mouth, you will be rejected! :-) Additionally, we also refer to sensitivity as the *power* of a test.

On the other hand, the more *specific* a model is, the less likely it is to reject a true statement. I am unable to think of an equally good analogy here in terms of the choice of word. I remember it this way: The more specific the description is, the better it is at verifying a true statement.

Let's summarize this section with a diagram:

In `sklearn`

, we can use the `confusion_matrix`

to easily obtain this table.

In [1]:

```
from sklearn.metrics import confusion_matrix
y_true = [1, 0, 1, 1, 0, 1]
y_pred = [0, 0, 1, 0, 0, 1]
confusion_matrix(y_true, y_pred).transpose()
```

Out[1]:

Note that the original confusion matrix is the transpose of the diagram I used above; hence I transposed it to make it match the diagram. We see that there is no false positive predictions. This is correct since out of the two 1 predictions, the true values are also 1. On the other hand, there are 2 false negatives predictions. Out of the four 0 predictions, two of them are incorrect.

Another handy function from `sklearn`

library is the `classification_report`

.

In [2]:

```
from sklearn.metrics import classification_report
target_names = ['negative', 'positive']
print(classification_report(y_true, y_pred, target_names=target_names))
```

Here we encounter several new vocabularies. In particular, we see `precision`

and `recall`

. A quick search in wikipedia yields the following excellent illustration:

In short, the term `recall`

is the same as `sensitivity`

. The term `precision`

is similar to `sensitivity`

in the sense that both are positively affected by the power of the test. The only difference is that `precision`

is negatively affected by total number of rejections, whereas `sensitivity`

is negatively affected by the number of points that are in $\Theta_0^c$ (the relevant items).

Returning to the previous example, where

$$ \begin{aligned} \text{truth} &= \{1, 0, 1, 1, 0, 1\} \\ \text{prediction} &= \{0, 0, 1, 0, 0, 1\} \end{aligned} $$

The second row of the classification report gives a precision of 1 and recall of 0.5. This agrees with our previous analysis. Previously, we find that there are no false negatives. In other words, out of all those (2) points that we predict to be positive, both are correct predictions. Hence, we have 100% `precision`

. However, out of all (4) points that are truly positive, we are only able to predict 2 of them. Hence, we have 50% `recall`

or `sensitivity`

or `power`

.

In general we want a balance between `precision`

and `recall`

!