Univariate Analysis with Binary Response I

June 21, 2020 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\newcommand{\st}{\,\big|\,}$ $\newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$ $\newcommand{\P}[1]{\mathbb{P}\left(#1\right)}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

When dealing with structured data with binary outcome (such as credit data), the most widely used model-evaluation metrics are the AUC (area under curve), the Gini coefficient, and the KS (Kolmogorov-Smirnov) statistic. Although these metrics are primarily used for global benchmark comparisons, they can also provide alternative insights to the local properties of a model.

KS statistic

We start with definitions to construct a framework. Let $Y$ be a binary outcome such that

$$ Y_k = \begin{cases} 1 & \text{if the response is positive} \\ 0 & \text{if the response is negative.} \end{cases} $$

Let $S$ be a another variable that is associated with $Y$. For instance, $S$ can be a score that gives us the probability of a positive response. Alternatively, $S$ can be another attribute that is predictive of $Y$. Given a vector of data $y_1,...,y_N$ and the associated vector $s_1,...,s_N$, we define the empirical cumulative distribution function (CDF) of the positive responses by

$$ F_1(s) = \frac{1}{n} \sum_{k=1}^N I(S_k \leq s, Y_k=1), \tag{1} $$

where $n=\sum_{k=1}^N I(Y_k=1)$ is the number of positive responses and $I(\cdot)$ is the indicator function. Similarly we define the empirical CDF of the negative responses by

$$ F_0(s) = \frac{1}{m} \sum_{k=1}^N I(S_k \leq s, Y_k=0), \tag{2} $$

where $m=\sum_{k=1}^N I(Y_k=0)$ is the number of negative responses. Clearly $N=n+m$.

Definition   The Kolmogorov-Smirnov (KS) statistic of $S$ with support $\mathcal{S}$ is defined as

$$ \kappa = \max_{s \in \mathcal{S}} \abs{F_0(s) - F_1(s)} \tag{3} $$

This definition makes sense intuitively. The smaller the $\kappa$, the less discriminatory power a model possesses. Figure 1 illustrates the KS statistic geometrically. Given a cut-off point $s$, the difference $\abs{F_0(s)-F_1(s)}$ tells us how many more negative (or positive) responses we can discriminate using $s$ instead of randomly guessing.

Figure 1 - KS statistic

Lorenz Curve and Gini Index

The Lorez curve (LC) is closely related to the Receiver Operating Characteristic Curve (ROC) and is defined parametrically by

$$ \begin{aligned} x &= F_0(s) \\ y &= F_1(s), \quad s \in \mathcal{S}. \end{aligned} \tag{4} $$
Figure 2 - Lorenz curve

In fact, the ROC curve is exactly the same as the LC except with the axis reversed. If we classify all examples as negative if $S\leq s$ and positive if $S >s$, the CDF values $F_0(s)$ and $F_1(s)$ are exactly the true positive and false positive rates, respectively. (Think about it. It's fun!)

Just like the AUC, the Gini index is defined as the following.

Definition   The Gini index is defined geometrically from Figure 2 as

$$ \mathcal{G} := \frac{A}{A+B} = 2A = 1-2B.\tag{5}$$

In the case when $F_0(s)$ and $F_1(s)$ are continuous functions, the Gini index is simply the integral of the parametric function

$$ \mathcal{G} = 1 - 2\int_{\mathcal{S}} F_1(s)\,\frac{dF_0(s)}{ds}\, ds. \tag{6} $$

Numerical methods can be used to approximate this area. Let's divide the support $\mathcal{S}$ into $r$ subintervals (or bins),

$$[q_0, q_1], (q_1, q_2],...,(q_{r-1}, q_r],$$

where $q_0$ and $q_r$ are the left and right boundary points of $\mathcal{S}$. Define

$$ \begin{aligned} Q_0 &:= (F_0(q_1),...,F_0(q_r))':= (F_{0,1}, ..., F_{0,r})' \\ Q_1 & := (F_1(q_1),...,F_1(q_r))':= (F_{1,1},...,F_{1,r})' \end{aligned} $$

If we have sufficient amount of data, then the components of $Q_0$ and $Q_1$ empirically trace out the Lorentz curve. The area of $B$ in Figure 2 can be approximated by taking the average of the upper and lower finite Riemann sums of $F_0(s)$. The Riemann sum is simply the sum of many small rectangles whose width is given by $F_{0,k} - F_{0,k-1}$ and whose height is the value of $F_{1,k-1}$ (lower sum) or $F_{1,k}$ (upper sum). As a result, we have

$$ \mathcal{G} = 1 - \sum_{k=1}^r (F_{0,k} - F_{0,k-1})(F_{1,k} + F_{1,k-1}). \tag{7} $$

A natural question one might ask is: "What is the right number of bins?" We will address that in the last section.

Density-based Metric and Information Value

Alternatively, let $f_0$ and $f_1$ to be the probability density function of $S$ restricted to negative and positive responses, respectively. We can define the information value $\mathcal{I}$ based on the Kullback-Liebler divergence between these two distributions. In this post, I propose the following definition.

Definition   The information value denoted by $\mathcal{I}$ is defined by

$$ \mathcal{I} = D_{\text{KL}}(f_0 || f_1) + D_{\text{KL}}(f_1 || f_0) \tag{8}$$

where if $f$ and $g$ are probability distributions, and $D_{\text{KL}}(f||g)$ is the Kullback-Liebler divergence defined by

$$ D_{\text{KL}}(f || g) = \mathbb{E}_{f} \log \left(\frac{f}{g}\right) $$

In the case when both $f_0$ and $f_1$ are probability density functions, we can rewrite (8) as

$$ \begin{aligned} \mathcal{I} &= \int_\mathcal{S} f_0(s) \log\left(\frac{f_0(s)}{g_0(s)}\right)\, ds + \int_{\mathcal{S}} f_1(s) \log\left(\frac{f_1(s)}{g_0(s)}\right)\, ds \\ &= \int_{\mathcal{S}} \left(f_1(s) - f_0(s)\right) \log\left(\frac{f_1(s)}{f_0(s)} \right)\, ds. \end{aligned} \tag{9} $$

Using the same binning technique as in (7), we can approximate the integral in (9) using a finite sum of probability mass

$$ \mathcal{I} = \sum_{k=1}^r \left((F_{1,k}-F_{1,k-1}) - (F_{0,k}-F_{0,k-1}) \right) \log\left(\frac{F_{1,k}-F_{1,k-1}}{F_{0,k}-F_{0,k-1}} \right). \tag{10} $$

Empirically, the difference of CDFs $F_{1,k}-F_{1,k-1}$ represents the probability of positive responses in the $k$th bin $(q_{k-1}, q_k]$ and similarly $F_{0,k}-F_{0,k-1}$ is the probability of negative responses in $(q_{k-1}, q_k]$.

Tree-based Metric and Gini Impurity

It is unjust to omit decision tree, one of the most efficient and explainable machine learning algorithms. You can read more about it in my paper here. To put it in context, we are interested in a binary tree split on the variable $S$. For a given split value $s$, let $L$ (or $R$) represent the proportion of positive responses in the left (or right) node and let $w$ denote the proportion of samples in the left node, i.e.,

$$ \begin{aligned} L &:= \frac{\sum_{k=1}^N I(S_k \leq s, Y_k=1)}{\sum_{k=1}^N I(S_k \leq s)} \\ R &:= \frac{\sum_{k=1}^N I(S_k > s, Y_k=1)}{\sum_{k=1}^N I(S_k > s)} \\ w &:= \frac{1}{N} \sum_{k=1}^N I(S_k \leq s) \end{aligned} $$

We can now define the entropy and the Gini impurity index (not to be confused with Gini index).

Definition   The weighted entropy $H$ and the weighted Gini impurity index $G$ associated with split $s\in \mathcal{S}$ are defined by

$$ \begin{aligned} G &:= L(1-L)w + R(1-R)(1-w) \\ H &:= L\log(1-L)w + R\log(1-R)(1-w) \end{aligned} \tag{11} $$

To obtain an efficient algorithm to compute these two metrics, suppose we have only 3 bins. The following table shows the algorithm for computing $H$ and $G$.

left_rate right_rate impurity
$$L_1 := \frac{e_1}{n_1}$$ $$R_1:= \frac{e_2+e_3}{n_2+n_3}$$ $$L_1(1-L_1)w_1 + R_1(1-R_1)(w_2+w_3)$$
$$L_2 := \frac{e_1+e_2}{n_1+n_2}$$ $$R_2 := \frac{e_3}{n_3}$$ $$L_2(1-L_2)(w_1+w_2) + R_2(1-R_2)w_3$$
$$L_3 := \frac{e_1+e_2+e_2}{n_1+n_2+n_3}$$ $$R_3 := 0$$ $--$

In this table, each row corresponds to a bin $(q_{k-1}, q_k]$. For the $i$th bin, the $w_i$ represents the proportion of examples, $n_i$ represents the total number of responses, and $e_i$ represents the total number of positive responses. From this table we can easily vectorize our code.

Binning

Finally let's briefly talk about binning. Starting with $r$ bins with boundary points $q_0,...,q_r$ corresponding to bins $B_1,...,B_r$, where $B_i$ corresponds to all values of $S$ that belongs to the interval $(q_{i-1}, q_i]$, in each bin $B_i$, define the variable $Z_i = \sum_{s_k\in B_i} Y_k$ to be the sum of target variables $Y$ in the $i$th bin. Since $Y$ is binary, we have

$$ Z_i \sim \text{Binomial}(n_i, p_i). $$

From data, we compute the maximum likelihood estimator as the sample proportion:

$$ \hat{p}_i = \frac{1}{n_i}\sum_{s_k\in B_i} Y_k. $$

For any two bins $B_i$ and $B_j$, we wish to compare their sample means. A widely used method is to look at the distribution of the log-odds ratio

$$ \begin{aligned} \log(\widehat{\text{OR}}) &= \log \left(\frac{\hat{p}_i}{1-\hat{p}_i} \right) - \log \left(\frac{\hat{p}_j}{1-\hat{p}_j} \right) \\ &= \log\, \left(\frac{\frac{\hat{p}_i}{1-\hat{p}_i}}{\frac{\hat{p}_j}{1-\hat{p}_j}} \right) = \log \left(\frac{\hat{p}_i(1-\hat{p}_j)}{\hat{p}_j(1-\hat{p}_i)} \right). \end{aligned} $$

Why take the log? The nice property of the log-odds ratio is that it is asymptotically normal. Let's simulate 10000 binomial experiments where $n_1=50, n_2=60, p_1=0.7$ and $p_2=0.3$ and plot the result of both odds ratio and log-odds ratio.

In [50]:
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

n1, p1 = 50, 0.7
n2, p2 = 60, 0.3

np.random.seed(7)
p1 = np.random.binomial(n1, p1, 10000)/n1
p2 = np.random.binomial(n2, p2, 10000)/n2
oddsratio = p1 * (1-p2) / p2 / (1-p1)
logoddsratio = np.log(oddsratio)

fig, (ax1, ax2) = plt.subplots(1,2,figsize=(15,5))
ax1.hist(oddsratio, bins=20, edgecolor='black')
ax1.set_title('Odds Ratio')
ax2.hist(logoddsratio, bins=20, edgecolor='black')
ax2.set_title('Log-odds Ratio')
plt.show()

We see that the odds ratio has a skewed distribution and, by taking the log, the log-odds ratio becomes approximately bell-shaped. Indeed, the Delta method can be used to show asymptotic normality and provide an estimate of the sample variance of the log-odds ratio test statistic. Recall that if $Y_n$ is a sequence of random variables that satisfies $\sqrt{n}(Y_n-\mu) \overset{d}{\to} N(0, \sigma^2)$. For any differerentiable function $g$ such that $g'(\mu)\neq 0$, then

$$ \sqrt{n}\left(g(Y_n) - g(\mu)\right) \overset{d}{\to}N(0, \sigma^2 (g'(\mu))^2). \tag{12} $$

Here the sequence of random variables is the sample proportions $\hat{p}$, which by the Central Limit Theorem is clearly asymptotically normal with mean $\mu=p$. The function $g$ is the logit function, whose derivative is given by

$$ \begin{aligned} g'(p) &= \frac{d}{dp}\log\left(\frac{p}{1-p}\right) \\ &= \frac{1-p}{p}\frac{(1-p)+p}{(1-p)^2} \\ &= \frac{1}{p(1-p)}. \end{aligned} $$

Hence by the Delta Method (12), we have

$$ \begin{aligned} \text{Var} \left[ \log(\widehat{\text{OR}})\right]&= \text{Var}\left[\log\left(\frac{\hat{p}_i}{1-\hat{p}_i}\right) \right] + \text{Var}\left[\log\left(\frac{\hat{p}_j}{1-\hat{p}_j}\right)\right] \\ &= \left(\frac{1}{\hat{p}_i(1-\hat{p}_i)}\right)^2 \frac{\hat{p}_i(1-\hat{p}_i)}{n_i} + \left(\frac{1}{\hat{p}_j(1-\hat{p}_j)}\right)^2 \frac{\hat{p}_j(1-\hat{p}_j)}{n_j} \\ &= \frac{1}{n_i\hat{p}_i(1-\hat{p}_i)} + \frac{1}{n_j\hat{p}_j(1-\hat{p}_j)} \\ &= \frac{1}{n_i\hat{p}_i} + \frac{1}{n_i(1-\hat{p}_i)} + \frac{1}{n_j\hat{p}_j} + \frac{1}{n_j(1-\hat{p}_j)} \\ &= \frac{1}{e_i} + \frac{1}{n_i-e_i} + \frac{1}{e_j} + \frac{1}{n_j-e_j}, \end{aligned} \tag{13} $$

where $e_i$ and $e_j$ are the number of positive target responses in bins $B_i$ and $B_j$. We summarize the results below.

Confidence Interval for Log-odds Ratio

The 95% confidence interval for the log-odds ratio is

$$ \log\widehat{\text{OR}} - 1.96 \text{SE} < \log \text{OR} < \log\widehat{\text{OR}} + 1.96 \text{SE}, $$

where $\widehat{OR} = \frac{\hat{p}_i(1-\hat{p}_j)}{\hat{p}_j(1-\hat{p}_i)}$ and

$$ \text{SE} = \sqrt{\frac{1}{e_i} + \frac{1}{n_i-e_i} + \frac{1}{e_j} + \frac{1}{n_j-e_j}}. $$

We start with $r$ bins based on percentiles in data. Then we continue to combine adjacent bins $B_i$ and $B_{i+1}$ if their 95% confidence interval for the log-odds ratio contains 0.