**Elan Ding**

*Modified*: July 2, 2018
$\newcommand{\bs}{\boldsymbol}$
$\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$
$\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$
$\newcommand{\tr}{^{\top}}$
$\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$
$\newcommand{\given}{\,|\,}$

In this post I will talk about the **Naive Bayes (NB) classifier**, which is a well-known machine learning model, and often the "go-to" approach, in classifying spam emails. The Naive Bayes is a type of generative learning model that assumes the distribution of $p(\bs{x}\given y)$, where $\bs{x}$ is our feature vector, and $y$ is the labeled class. It then uses *Bayes' Theorem* to try to estimate $p(y\given \bs{x})$ by the following

$$ p(y\given \bs{x}) = \frac{p(\bs{x}\given y) p(y)}{p(\bs{x})} $$

Before we talk about the NB classifier, we need to briefly discuss a broader model, called the *Gaussian Discriminant Analysis*.

The **Gaussian Discriminant Analysis** (GDA), also referred to as the **Linear Discriminant Analysis** (LDA), makes the assumption that $p(\bs{x}\given y)$ follows a multivariate Gaussian distribution. That is

$$ p(\bs{x}\given y) = \frac{1}{(2\pi)^{p/2} |\Sigma|^{1/2}} \text{exp}\left(-\frac{1}{2}(\bs{x}-\bs{\mu}_y)\tr \Sigma^{-1}(\bs{x}-\bs{\mu}_y)\right) \tag{1} $$

where $\bs{\mu}_y = E(\bs{x}\given y)$ and $\Sigma = E\left((\bs{x}-\bs{\mu}_y) (\bs{x}-\bs{\mu}_y)\tr\right)$, the covariance matrix, is assumed to be the same for all classes $y$.

Futhermore, let's assume that our class variable $y$ follows a *multiclass Bernoulli* distribution. That is $y$ takes on a value among $\{1,2,...,k\}$ with probabilities $\pi_1, \pi_2,..., \pi_k$. These probabilities are called the *prior probabilities*. The probability mass function for $y$ can be written as

$$ p(y) = \prod_{i=1}^k \pi_i^{I(y=i)}, \quad \text{ such that }\sum_{i=1}^k \pi_i =1. \tag{2} $$

where $I(\cdot)$ is the indicator function.

Here our parameters are $\theta = (\pi_1,..., \pi_{k-1}, \bs{\mu}_1,..., \bs{\mu}_k, \Sigma)$. (Note that I left out the parameter $\pi_k$ since it can be determined after we know the other $k-1$ prior probabilities.)

Now for a training set $\{(\bs{x}^{(1)}, y^{(1)}), (\bs{x}^{(2)}, y^{(2)}),..., (\bs{x}^{(n)}, y^{(n)})\}$, We can construct the *log-likelihood function* for $\theta$ as

$$ \begin{aligned} l(\theta) &= \log p_{\theta}(\bs{x}, y) \\ &= \log \prod_{i=1}^n p_{\theta} (\bs{x}^{(i)} \given y^{(i)})p_{\theta}(y^{(i)}) \end{aligned}\tag{3} $$

where $p_{\theta}(\bs{x}\given y)$ and $p_{\theta}(y)$ are given by $(1)$ and $(2)$, respectively.

The parameters $\theta$ can be estimated using the *maximum likelihood estimators (MLE)*, i.e. we can use calculus to maximize maximize (3) with respect to $\theta$. The solutions are

$$ \begin{aligned} \widehat{\pi}_k &= \frac{1}{n}\sum_{i=1}^{n} I(y^{(i)}=k) \\ \widehat{\bs{\mu}}_k &= \frac{\sum_{i=1}^n I(y^{(i)}=k)\bs{x}^{(i)}}{\sum_{i=1}^n I(y^{(i)}=k)} \\ \widehat{\Sigma} &= \frac{1}{n} \sum_{i=1}^n (\bs{x}^{(i)}-\bs{\mu}_{y^{(i)}})(\bs{x}^{(i)}-\bs{\mu}_{y^{(i)}})\tr \end{aligned} $$

The derivation of the MLE is rather tedious (refer to this for the derivation), but the result is enlightening. The MLE for the prior probability $\pi_k$ is simply the relative frequency of class $k$ in the training set. The MLE for the mean vector $\bs{\mu}_k$ in class $k$ is the average of the data vectors $\bs{x}^{(i)}$ in class $k$. The MLE for the covariance matrix is again an expected result.

After obtaining an estimate for all parameters, we can draw $k-1$ `decision boundaries`

between the $k$ classes. For example, between $k=1$ and $k=2$, we equate prababilities $p(y=1 \given \bs{x})$ with $p(y=2 \given \bs{x})$, and take logs on both sides to obtain the equation:

$$ \bs{x}\tr \Sigma \bs{\mu}_1 - \frac{1}{2} {\bs{\mu}_1}\tr \Sigma^{-1}\bs{\mu}_1 + \log \pi_1 = \bs{x}\tr \Sigma \bs{\mu}_2 - \frac{1}{2} {\bs{\mu}_2}\tr \Sigma^{-1}\bs{\mu}_2 + \log \pi_2 $$

Indeed, this is a linear function!

We will use a classic dataset from UCI database. This dataset contains the frequencies of 48 words in 4601 emails containing both spam and non-spam emails

In [1]:

```
import pandas as pd
import numpy as np
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
```

In [2]:

```
df = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/spambase/spambase.data",
header=None)
```

In [3]:

```
df[[0,1,2,3,4, 57]].head()
```

Out[3]:

I only showed the first five columns of the data table. They are the proportions of the words "make", "address", "all", "3d", "our" appearing in the email. The last column is a binary variable (1 if spam, 0 if not spam).

In [4]:

```
clf = LinearDiscriminantAnalysis()
```

In [5]:

```
from sklearn.model_selection import cross_val_score
```

In [6]:

```
scores = cross_val_score(
clf, df.drop(57, axis=1), df[57], cv=5, scoring='f1_macro')
```

In [7]:

```
print("The 95% C.I. of score: {:.3} (+/-) {:.3}".format(
scores.mean(), scores.std()*2))
```

Using 5-Fold cross-validation, we predict that the the model will accurately predict about 87% of the test set. The standard error is quite small in this case, showing the consistency of the classifier.

Consider a spam classifier containing 50,000 words instead of merely 48 in our previous example. Then using GDA would mean to invert a 50,000 by 50,000 matrix. This is computationally expensive. Gaussian Naive Bayes classifier makes an assumption that the covariance matrix $\Sigma$ is diagonal. Let $\bs{x}=(x_1,x_2,..., x_p)$. Under the Naive Bayes assumption, $x_i$ and $x_j$ are conditionally independent given $y$. In this case we can represent the joint distribution of the vector as

$$ p(\bs{x}\given y) = \prod_{i=1}^p p(x_i \given y) $$

where each $p(x_i \given y)$ follows a univariate Gaussian distribution with parameters $\mu_i$ and $\Sigma_{i,i}$. Let's try it on the same dataset.

In [8]:

```
from sklearn.naive_bayes import GaussianNB
```

In [9]:

```
gnb = GaussianNB()
```

In [10]:

```
scores = cross_val_score(
gnb, df.drop(57, axis=1), df[57], cv=5, scoring='f1_macro')
```

In [11]:

```
print("The 95% C.I. of score: {:.3} (+/-) {:.3}".format(
scores.mean(), scores.std()*2))
```

Using 5-Fold cross-validation, we predict that the the model will accurately predict about 82% of the test set, with a rather large variation. This is to be expected since we made a much stronger assumption about our dataset. However, the drop in performance is acceptable.

In addition to the Naive Bayes assumption, we further suppose that $x_i \given y$ follows a Bernoulli distribution with parameter $\phi_{i,y}$ where

$$ \begin{aligned} \phi_{i,y} &= p(x_i=1\given y)\\ \pi_j &= p(y=j) \end{aligned} $$

Now our parameters $\theta$ is a set of $pk+k-1$ parameters $\{\pi_k, \phi_{i,y}\}$, for $i=1,...,p$, $y=1,...,k$, and $j=1,...,k-1$.

Given a training set $\{(\bs{x}^{(1)}, y^{(1)}), (\bs{x}^{(2)}, y^{(2)}),..., (\bs{x}^{(n)}, y^{(n)})\}$, we can again construct the likelihood function to be

$$ \begin{aligned} L(\theta)& =\prod_{i=1}^n p_{\theta}(\bs{x}^{(i)}, y^{(i)}) \\ &= \prod_{i=1}^n \prod_{j=1}^p p_{\theta} (x^{(i)}_j \given y^{(i)}) p_{\theta} (y^{(i)}) \end{aligned} $$

Again the MLE of $\theta$ can be found by maximizing $L(\theta)$. The solution is

$$ \begin{aligned} \widehat{\phi}_{j,l} &= \frac{\sum_{i=1}^n I(x^{(i)}_j=1, y^{(i)} = l) \color{red}+\color{red} 1} {\sum_{i=1}^n I(y^{(i)}=1) \color{red}+\color{red}2 } \\ \widehat{\pi}_l &= \frac{\sum_{i=1}^n I(y^{(i)} = l)}{n}. \end{aligned} $$

where the $+1$ and $+2$ in the fraction is due to Laplace smoothing.

Using the MLE, the algorithm compares the posterior probabilities $p(y \given \bs{x})$ to make a decision.

The `sklearn`

is truly a remarkable library. Referring to the documentation, we see that the `BernoulliNB`

has an option called `binarize`

. At default, this hyperparameter is set to 0. This means that the classifier simply looks for the presence of each word in the email. ($x_i=1$ if the word $i$ is present and 0 otherwise).

In [12]:

```
from sklearn.naive_bayes import BernoulliNB
```

In [13]:

```
bnb = BernoulliNB(binarize=0)
```

In [14]:

```
scores = cross_val_score(
bnb, df.drop(57, axis=1), df[57], cv=5, scoring='f1_macro')
```

In [15]:

```
print("The 95% C.I. of score: {:.3} (+/-) {:.3}".format(
scores.mean(), scores.std()*2))
```

Amazing. The Bernoulli classifier does just as well as the original GDA, which is more computationally expensive. The following quote from the documentation is instructive:

"In the case of text classification, word occurrence vectors (rather than word count vectors) may be used to train and use this classifier. BernoulliNB might perform better on some datasets, especially those with shorter documents. It is advisable to evaluate both models, if time permits. "

Finally, suppose we tune our hyperparameter `binarize`

to be 0.1, the model actually improves.

In [16]:

```
bnb2 = BernoulliNB(binarize=0.1)
```

In [17]:

```
scores = cross_val_score(
bnb2, df.drop(57, axis=1), df[57], cv=5, scoring='f1_macro')
```

In [18]:

```
print("The 95% C.I. of score: {:.3} (+/-) {:.3}".format(
scores.mean(), scores.std()*2))
```