Univariate Analysis with Binary Response II

June 21, 2020 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\newcommand{\st}{\,\big|\,}$ $\newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$ $\newcommand{\P}[1]{\mathbb{P}\left(#1\right)}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

In this post we would like to test the theory of univariate analysis presented in the previous post. For simplicity, we use the breast cancer dataset. A random noise is added to each attribute to make the task seem less trivial. The goal is to look at the results and see that they make sense. There are in total 30 attributes and 569 examples. The target variable is_benign equals 1 if the tumor is benign and 0 otherwise.

Decision tree is one of the most effective and explainable techniques. It turns out that a simple tree with only 2 splits is a fairly good model. If we let $S$ denote the predicted probability from the tree, the results from the univariate analysis is shown below. We observe that the Receiver Operation Curve (ROC) is simply a diagonal reflection of the Lorenz Curve (LC), and the following relationship holds between the Gini coefficient and AUC: $\mathcal{G} = 2\cdot\text{AUC}-1$.

Figure 1 - Tree performance

We also see that the binning technique discussed in the previous post is working as expected here. There are 4 bins (after combining), corresponding to the 4 terminal leaves of the tree.

Figure 2 - Tree plot

The top nodes of the trees are often the most predictive attributes in the "univariate" sense. From this mean concave points, worst radius and worst perimeter are some of the variables that, when split, produce the biggest reduction in the weighted Gini impurtiy. If we let $S$ denote one of these attributes, the result gives us a lot of information:

Figure 3 - Univariate analysis for "mean concave point"

You may be wondering why the AUC is so low. This makes sense since we expect an inverse relationship between mean concave point and the target variable is_benign. The bottom left graph is the Lorenz curve after we transform the variable (by simply sorting the bins) to force the rates to be monotonically increasing. The top right scatter plot is obtained by plotting the log-odds of the target variable in each bin with the maximum input value in that bin. Let's talk about this transformation.

The Weight of Evidence

Suppose we have one predictor $X$, and a binary response variable $Y$. Using the Bayes' theorem in odds form, the log-odds of the response variable $Y$ is given by

$$ \text{log} \frac{P(Y=1 \given X)}{P(Y=0\given X)} = \log \frac{P(Y=1)}{P(Y=0)} + \log \frac{f(X\given Y=1)}{f(X\given Y=0)}. \tag{1} $$

The term on the left hand side of the equation is the posterior log-odds given $X$. The first term on the right hand side of the equation is the prior log-odds of the response (which can be calculated a priori without seeing the data). The right-most term is known as the Weight of Evidence (WOE) for $X$, which describes the relationship between $X$ and $Y$ and is perfectly correlated with the posterior log-odds. Hence, the higher the WOE, the more certain we are about observing a $Y=1$.

In logistic regression, we estimate the posterior probability using the sigmoid function

$$ P(Y=1\given X) = \frac{1}{1+e^{-(\theta_0 + \theta_1 X)}}. $$

Taking the inverse, we can express the log-odds as a linear function of $X$:

$$ \log \frac{P(Y=1\given X)}{P(Y=0\given X)} = \theta_0 + \theta_1X. $$

Here $\theta_0$ is exactly the prior log-odds in (1), and $\theta_1 X$ captures the WOE of $X$. So logistic regression is simply a linear estimation of the Weight of Evidence!

Aside from the discriminative models like logistic regression, we can view (1) in terms of generative models such as the Naive Bayes classifier. In the generative framework, we usually estimate $f(X\given Y=1)$ using maximum likelihood, which often has a closed-form solution, and takes less time to train than using iterative methods such as gradient descent. For instance, if the conditional distributions of $X$ given $Y=0$ and $Y=1$ are assumed to be Gaussian with the same diagonal covariance structure, given data $x_1,...,x_N$ and $y_1,...,y_N$, the closed form solution of WOE is also linear in $X$ and is given by

$$ \begin{aligned} \log \frac{f(X\given Y=1)}{f(X\given Y=0)} &= \frac{1}{2\hat{\sigma}^2} \left(2(\hat{\mu}_1-\hat{\mu}_0)X -(\hat{\mu}_1^2 - \hat{\mu}_0^2)\right), \end{aligned} $$

where $\hat{\sigma}, \hat{\mu}_0$ and $\hat{\mu}_1$ are the MLE of the posterior joint distribution $p(x, y)$ given by

$$ \begin{aligned} \hat{\mu}_0 &= \frac{\sum_{i=1}^N I(y_i=0)x_i}{\sum_{i=1}^N I(y_i=0)} \\ \hat{\mu}_1 &= \frac{\sum_{i=1}^N I(y_i=1)x_i}{\sum_{i=1}^N I(y_i=1)} \\ \hat{\sigma} &= \frac{1}{N} \sum_{i=1}^N (x_i-\hat{\mu}_{y_i})^2. \end{aligned} $$

In the credit scoring world, it is a common practice to transfrom the variable $X$ to its WOE. Let $B_1,...,B_r$ denote the bins of $X$, and let $B(X)$ denote the bin that contains $X$. We define the WOE transformation as

$$ Z = \log \frac{f(X \given Y=1)}{f(X\given Y=0)} \approx \log \frac{P(X\in B(X)\given Y=1)}{P(X\in B(X) \given Y=0)} , $$

where the final expression can be easily estimated. From (1), we know that (theoretically) $Z$ is perfectly correlated with the WOE; thus making logistic regression on $Z$ (instead of $X$) more appropriate. Of course, the obvious downside of this transformation is that we are losing information by transforming a continuous variable to a finite set of values. Another con of this method is that it is prone to cause overfitting by "divulging" target variable's information into the input attributes. However, logistic regression is known to be unstable when the class separation is large. This transformation not only stablizes the model but also automatically takes care of missing values.

Final Remarks

Finally let's show another example using one of the weaker attributes from the model.

Figure 4 - Univariate analysis for "smoothness error"

Here we see that the univariate predictive power of this attribute is much weaker; however, the Gini coefficient is failing to capture the fact that we see a large variability in rate as shown in the Gains chart. It turns out that this variability is captured by the metric transformed entropy. This makes perfect sense since entropy is defined using a binary tree split regardless of the weight in each node. We conclude with a pairplot of different metrics.

Figure 5 - A comparison of univariate metrics

We see very strong correlation among transformed_gini, transformed_ks, transformed_iv, and entropy. The final metric transformed_entropy shows a different pattern because it is more concerned with the variability in the bins!