Univariate Analysis III

July 1, 2020 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\newcommand{\st}{\,\big|\,}$ $\newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$ $\newcommand{\P}[1]{\mathbb{P}\left(#1\right)}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

In the previous posts, we assumed that the target variable $Y$ follows a Bernoulli distribution with expected value $p$, and that $g(p)$ is linearly correlated with the input vector $S$ where $g$ is the logit function. In this post we would like to generalize the beautiful theory of univariate analysis. A lot of the contents are invented by myself, but I think they are quite reasonable and provide great insights.

Bernoulli Bifurcation

Suppose $Y$ follow a distribution from the exponential family. Let $S$ be another variable that is associated with $Y$. For instance, it can be a vector of predictions from a model. It can also be the raw values from an attribute. Let $g$ denote the canonical link function for $Y$, and we assume that $S$ is linearly correlated with the $g(E(Y))$. Given data $y_1,...,y_N$, let $\hat{E}$ denote the MLE of $E(Y)$. For instance if $Y$ is Bernoulli, $\hat{E}$ is the sample proportion $\hat{p}$. If $Y$ is Poisson, $\hat{E}$ is the sample mean $\hat{\lambda}$.

Analogous to the Bernoulli case, given data $s_1,...,s_N$ and target $y_1,...,y_N$, we define the empirical CDF of the "positive" response by

$$ F_1(s) = \frac{1}{n} \sum_{k=1}^N I(S_k\leq s, Y_k > \hat{E}), $$

where $Y_k$ is "positive" if it is above the sample mean, and $n = \sum_{k=1}^N I(Y_k >s)$ is the number of the positive responses and $I(\cdot)$ is the indicator function. Similarly we have

$$ F_0(s) = \frac{1}{m} \sum_{k=1}^N I(S_k\leq s, Y_k \leq \hat{E}), $$

where $m = \sum_{k=1}^N I(Y_k\leq s)$ is the number of "negative" responses. From here, we can define all the metrics (KS, Gini, AUC, information value, impurity and entropy) exactly as before. Needless to say this provides a quick and dirty way to perform WOE transformation for any type of univariate response variable $Y$!

In general this method can quickly pick out the most important attributes from the model in the univariate sense. This method can be further improved by changing the threshold from $\hat{E}$ to the various quantiles of $Y$. The final generalized metric values can be a weighted average of these values. The choice of the percentiles should also depend on the goal of the model.

Categorical Distribution

If $Y$ follows the categorical distribution, then it takes values in $\{c_1,...,c_K\}$ with probabilities $\{\pi_1,...,\pi_K\}$. The probability mass function for $Y$ can be written as

$$ P(Y=c) = \prod_{i=1}^K \pi_i^{I(Y=c)}. $$

It is non-sensical to talk about the mean and variance of this distribution, since the values that $Y$ takes on is categorical. However, we can easily work around this using an one-vs-all transformation.

For any integer $n$ between $1$ and $K$, we define

$$ Y_n = \begin{cases} 1 & \text{ if } Y = c_n \\ 0 & \text{ otherwise.} \end{cases} $$

It follows that $Y_n \sim \text{Bernoulli}(\pi_n)$, for $n=1,...,K$. Then final univariate metrics can be the average of these $K$ individual metrics.

Poisson Distribution

In practice, we often need to model a discrete count. For example, $Y$ could be the number of car accidents in a given period of time, or the number of calls a service location receives. In these cases, a Poisson distribution is appropriate. Recall that if $Y$ is Poisson with arrival intensity $\lambda$, it takes values in $\{0,1,2,...,\}$, and

$$ P(Y=y) = \frac{\lambda^y e^{-\lambda}}{y!}. $$

The sample mean of the Poisson distribution in bin $B_i$ is

$$ \hat{\lambda}_i = \frac{1}{n_i} \sum_{s_k \in B_i} Y_k. $$

Thus the log-rate ratio is asymptotically normal with variance equal to

$$ \begin{aligned} \text{Var} \log \left(\frac{\hat{\lambda}_i}{\hat{\lambda}_j} \right) &= \text{Var} \log \hat{\lambda}_i + \text{Var} \log \hat{\lambda}_j\\ &\approx \frac{1}{n_i^2} \frac{n_i\hat{\lambda}_i}{\hat{\lambda}_i^2} + \frac{1}{n_j^2} \frac{n_1\hat{\lambda}_j}{\hat{\lambda}_j^2} \\ &= \frac{1}{n_i\hat{\lambda}_1} + \frac{1}{n_j\hat{\lambda}_j} \\ &= \frac{1}{\sum_{s_k \in B_i} Y_k} + \frac{1}{\sum_{s_k\in B_j} Y_k}. \end{aligned} $$

Normal Distribution

In ordinary least squares problem, the target variable $Y$ is assumed to be normaly distributed with mean $\mu$ and variance $\sigma^2$. Since the canonical link function is the identity function, we simply look at the distriubtion of the difference of two sample means. The sample mean and variance in bin $B_i$ can be computed as

$$ \begin{aligned} \hat{\mu}_i &= \frac{1}{n_i} \sum_{s_k \in B_i} Y_i \\ \hat{\sigma}_i^2 &= \frac{1}{n_i} \sum_{s_k \in B_i} (Y_i - \hat{\mu}_i)^2. \end{aligned} $$

Given two bins $B_i$ and $B_j$, the variance of the difference of the sample means can be computed as

$$ \begin{aligned} \text{Var} \left(\hat{\mu}_i - \hat{\mu}_j \right) &= \text{Var} \hat{\mu}_j + \text{Var} \hat{\mu}_j\\ &= \frac{1}{n_i^2} n_i\hat{\sigma}_i^2 + \frac{1}{n_j^2} n_j\hat{\sigma}_j^2\\ &= \frac{1}{n_i^2} \sum_{s_k\in B_i} (Y_k-\hat{\mu}_i)^2 + \frac{1}{n_j^2} \sum_{s_k\in B_j} (Y_k-\hat{\mu}_j)^2 \\ &= \frac{\sum Y_k^2 - 2\hat{\mu}_i\sum Y_k + n_i\hat{\mu_i}^2}{n_i^2} + \frac{\sum Y_k^2 - 2\hat{\mu}_j\sum Y_k + n_j\hat{\mu_j}^2}{n_j^2}, \end{aligned} $$

where the last expression is written to facilitate code vectorization.

Gamma Distribution

Gamma distribution is another family of distribution that is widely used in economics and insurance risk modeling. It is well known that the sum of i.i.d. exponential random variables follow a Gamma distribution; hence Gamma distriubiton is appropriate to model waiting times between events.

If $X\sim \text{Gamma}(\alpha, \beta)$, where $\alpha>0$ is the shape parameter, and $\beta>0$ is the rate parameter, then we have

$$ f(x) = \frac{\beta^\alpha}{\Gamma(\alpha)} x^{\alpha-1} e^{-\beta x} I(X>0), $$

where $\Gamma$ is the Gamma function that has the amazing property that $\Gamma(\alpha+1) = \alpha\Gamma(\alpha)$ for all $\alpha>0$. The expected value and the variance of the Gamma distribution are given by $\alpha/\beta$ and $\alpha/\beta^2$, respectively.

Since the maximum likelihood estimation does not yield close-form solutions, we can use a quick and dirty method called the method of moments (MOM), which yields the following system

$$ \begin{aligned} \frac{1}{n} \sum Y_i &= \alpha / \beta \\ \frac{1}{n} \sum Y_i^2 &= \alpha / \beta^2 + \alpha^2 / \beta^2. \end{aligned} $$

From this we solve for the variance $\alpha/\beta^2$ and obtain

$$ \text{Var}(X) = \alpha/\beta^2 \approx \frac{1}{n} \sum Y_i^2 - \left(\frac{1}{n}\sum Y_i\right)^2. $$

Let $\hat{\mu}_i$ and $\hat{\mu}_j$ be the sample means for the Gamma distribution in bins $B_i$ and $B_j$. We use the Delta method to obtain an empirical estimator for the log-rate ratio:

$$ \begin{aligned} \text{Var}\log\left(\frac{\hat{\mu}_i}{\hat{\mu}_j} \right) &= \text{Var}\log \hat{\mu}_i + \text{Var}\log \hat{\mu}_j\\ &\approx \frac{1}{n_i^2} \frac{n_i \left(\frac{1}{n_i} \sum Y_k^2 - \left(\frac{1}{n_i} \sum Y_k \right)^2\right)}{\left(\frac{1}{n_i} \sum Y_k\right)^2} + \frac{1}{n_j^2} \frac{n_j \left(\frac{1}{n_j} \sum Y_k^2 - \left(\frac{1}{n_j} \sum Y_k \right)^2\right)}{\left(\frac{1}{n_j} \sum Y_k\right)^2} \\ &= \frac{\sum Y_k^2 - \frac{1}{n_i} \left(\sum Y_k\right)^2}{\left(\sum Y_k\right)^2} + \frac{\sum Y_k^2 - \frac{1}{n_j} \left(\sum Y_k\right)^2}{\left(\sum Y_k\right)^2}. \end{aligned} $$

Results

Let's showcase the results using the breast cancer dataset again. Instead of using the binary response variable, we would like to see the univariate power of mean radius in predicting worst area. We assume that the worst area follows a normal distribution. Below are the results using the median (50% percentile) as the threshold value for Bernoulli bifurcation. Note that in the transformed KS, we plot the WOE value against the cumulative distributions instead of the raw values.

Figure 1 - Bernoulli bifurcation, 50th percentile

However, is we use a higher percentile value, say 80, to bifurcate the variable $Y$, we see the univariate strength increases:

Figure 2 - Bernoulli bifurcation, 80th percentile

This tells us the value of mean radius is stronger in predicting larger values of $Y$ than smaller values. The overall gini value turns out to be 0.5.