June 21, 2020 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\newcommand{\st}{\,\big|\,}$ $\newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$ $\newcommand{\P}[1]{\mathbb{P}\left(#1\right)}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

In this post we would like to test the theory of univariate analysis presented in the previous post. For simplicity, we use the breast cancer dataset. A random noise is added to each attribute to make the task seem less trivial. The goal is to look at the results and see that they make sense. There are in total 30 attributes and 569 examples. The target variable `is_benign`

equals 1 if the tumor is benign and 0 otherwise.

Decision tree is one of the most effective and explainable techniques. It turns out that a simple tree with only 2 splits is a fairly good model. If we let $S$ denote the predicted probability from the tree, the results from the univariate analysis is shown below. We observe that the Receiver Operation Curve (ROC) is simply a diagonal reflection of the Lorenz Curve (LC), and the following relationship holds between the Gini coefficient and AUC: $\mathcal{G} = 2\cdot\text{AUC}-1$.

We also see that the binning technique discussed in the previous post is working as expected here. There are 4 bins (after combining), corresponding to the 4 terminal leaves of the tree.

The top nodes of the trees are often the most predictive attributes in the "univariate" sense. From this `mean concave points`

, `worst radius`

and `worst perimeter`

are some of the variables that, when split, produce the biggest reduction in the weighted Gini impurtiy. If we let $S$ denote one of these attributes, the result gives us a lot of information:

You may be wondering why the AUC is so low. This makes sense since we expect an inverse relationship between `mean concave point`

and the target variable `is_benign`

. The bottom left graph is the Lorenz curve after we transform the variable (by simply sorting the bins) to force the rates to be monotonically increasing. The top right scatter plot is obtained by plotting the log-odds of the target variable in each bin with the maximum input value in that bin. Let's talk about this transformation.

Suppose we have one predictor $X$, and a binary response variable $Y$. Using the **Bayes' theorem in odds form**, the log-odds of the response variable $Y$ is given by

The term on the left hand side of the equation is the **posterior log-odds** given $X$. The first term on the right hand side of the equation is the **prior log-odds** of the response (which can be calculated a priori without seeing the data). The right-most term is known as the **Weight of Evidence (WOE)** for $X$, which describes the relationship between $X$ and $Y$ and is perfectly correlated with the posterior log-odds. Hence, the higher the WOE, the more certain we are about observing a $Y=1$.

In logistic regression, we estimate the posterior probability using the sigmoid function

$$ P(Y=1\given X) = \frac{1}{1+e^{-(\theta_0 + \theta_1 X)}}. $$Taking the inverse, we can express the log-odds as a linear function of $X$:

$$ \log \frac{P(Y=1\given X)}{P(Y=0\given X)} = \theta_0 + \theta_1X. $$Here $\theta_0$ is exactly the prior log-odds in (1), and $\theta_1 X$ captures the WOE of $X$. So logistic regression is simply a linear estimation of the Weight of Evidence!

Aside from the discriminative models like logistic regression, we can view (1) in terms of generative models such as the Naive Bayes classifier. In the generative framework, we usually estimate $f(X\given Y=1)$ using maximum likelihood, which often has a closed-form solution, and takes less time to train than using iterative methods such as gradient descent. For instance, if the conditional distributions of $X$ given $Y=0$ and $Y=1$ are assumed to be Gaussian with the same diagonal covariance structure, given data $x_1,...,x_N$ and $y_1,...,y_N$, the closed form solution of WOE is also linear in $X$ and is given by

$$ \begin{aligned} \log \frac{f(X\given Y=1)}{f(X\given Y=0)} &= \frac{1}{2\hat{\sigma}^2} \left(2(\hat{\mu}_1-\hat{\mu}_0)X -(\hat{\mu}_1^2 - \hat{\mu}_0^2)\right), \end{aligned} $$where $\hat{\sigma}, \hat{\mu}_0$ and $\hat{\mu}_1$ are the MLE of the posterior joint distribution $p(x, y)$ given by

$$ \begin{aligned} \hat{\mu}_0 &= \frac{\sum_{i=1}^N I(y_i=0)x_i}{\sum_{i=1}^N I(y_i=0)} \\ \hat{\mu}_1 &= \frac{\sum_{i=1}^N I(y_i=1)x_i}{\sum_{i=1}^N I(y_i=1)} \\ \hat{\sigma} &= \frac{1}{N} \sum_{i=1}^N (x_i-\hat{\mu}_{y_i})^2. \end{aligned} $$In the credit scoring world, it is a common practice to transfrom the variable $X$ to its WOE. Let $B_1,...,B_r$ denote the bins of $X$, and let $B(X)$ denote the bin that contains $X$. We define the **WOE transformation** as

where the final expression can be easily estimated. From (1), we know that (theoretically) $Z$ is perfectly correlated with the WOE; thus making logistic regression on $Z$ (instead of $X$) more appropriate. Of course, the obvious downside of this transformation is that we are losing information by transforming a continuous variable to a finite set of values. Another con of this method is that it is prone to cause overfitting by "divulging" target variable's information into the input attributes. However, logistic regression is known to be unstable when the class separation is large. This transformation not only stablizes the model but also automatically takes care of missing values.

Finally let's show another example using one of the weaker attributes from the model.

Here we see that the univariate predictive power of this attribute is much weaker; however, the Gini coefficient is failing to capture the fact that we see a large variability in **rate** as shown in the Gains chart. It turns out that this variability is captured by the metric **transformed entropy**. This makes perfect sense since entropy is defined using a binary tree split regardless of the weight in each node. We conclude with a pairplot of different metrics.

We see very strong correlation among `transformed_gini`

, `transformed_ks`

, `transformed_iv`

, and `entropy`

. The final metric `transformed_entropy`

shows a different pattern because it is more concerned with the variability in the bins!