Nov. 18, 2018 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\DeclareMathOperator{\E}{\mathbb{E}}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

This is the first part in the upcoming series about the fascinating topic of the `generalized linear models`

(GLM). Although the truth is rarely linear, having a solid grasp on linear models is essential to fully grasp the beauty and craft of statistical modeling.

We consider the `exponential dispersion family`

of distributions in the general form:

$$ p(y_i \given \theta_i) = \text{exp} \left(\frac{y_i\theta_i-b(\theta_i)}{a_i(\phi)} + c(y_i, \phi) \right), \tag{1} $$

where $\theta_i, \phi$ are the location and scale parameters, and $c(y_i, \phi)$ and $a_i(\phi)$ are known functions. Usually the term $a_i(\phi)$ has the form $a_i(\phi) = \phi/\omega_i$, for a known weight $\omega_i$.

The expected value of $Y_i$ can be computed using the following trick. First we note that

$$ \int p(y_i \given \theta_i)\, dy_i =1. $$

Without getting into too much technical details, assume that we can interchange the order of differentiation and integration. Hence,

$$ \frac{\partial}{\partial \theta_i} \int p(y_i \given \theta_i)\, dy_i = \int\frac{\partial}{\partial \theta_i} p(y_i \given \theta_i)\, dy_i =0. $$

The partial derivative of the density can be computed as

$$ \frac{\partial}{\partial \theta_i} p(y_i \given \theta_i) = (y_i -b'(\theta_i)) p(y_i \given \theta_i). $$

This implies that

$$ \int y_i\, p(y_i \given \theta_i)\, dy_i - \int b'(\theta_i) p(y_i \given \theta_i)\, dy_i = 0. $$

From this we can see that

$$ \E (Y_i) = \mu_i = b'(\theta_i). \tag{2} $$

Using a similar trick, we can also show that

$$ \text{Var}(Y_i) = \sigma_i^2 = b''(\theta_i)\,a_i(\phi). \tag{3} $$

The most notable assumption of the `generalized linear model`

(GLM) is that there is a bijective, continuous, and differentiable function $g(\mu_i)$, called the `link function`

, such that

$$ g(\mu_i) = \eta_i = \bs{x}_i\tr \bs{\beta} = \sum_{j=1}^p \beta_j x_{ij}, \tag{4} $$

where $\bs{x}_i \in \mathbb{R}^p$ is a vector of covariates for the $i$th observation and $\bs{\beta}$ is the vector of their corresonding coefficients. The variable $\eta_i$ is also called the `linear predictor`

.

If the linear predictor $\eta_i$ is the same as the `natural parameter`

$\theta_i$, then we call the corresponding link $g(\mu_i)$ the `canonical link function`

. Since $\mu_i = b'(\theta_i)$ by (2), we have that $\theta_i = (b')^{-1}(\mu_i)$. Thus, for ordinary linear regression, the canonical link is the identity function. For the binomial distribution, the canonical link is the logit function. For the Poisson distribution, the canonical link is the log function.

For example, suppose that $n_iY_i \sim \text{Binomial}(n_i, p_i)$. (This means that $y_i$ represetns the `sample proportion`

of successes in $n_i$ Bernoulli trials.) We let $\theta_i = \text{logit}(p_i)$, then $p_i = P(\theta_i)$, where $P$ is the logistic function. Thus, we have $\log(1-p_i) = -\log(1+\text{exp}(p_i))$. Hence, the data model can be written as

$$ \begin{aligned} p(y_i \given p_i, n_i) &= \binom{n_i}{n_iy_i} p_i^{n_iy_i}(1-p_i)^{n_i-n_iy_i} \\ &= \text{exp}\left[\frac{y_i\theta_i - \log(1+\text{exp}(\theta_i))}{1/n_i} + \log\binom{n_i}{n_iy_i} \right]. \end{aligned} $$

Thus the distribution of the sample proportion $Y_i$ is in the exponential dispersion family, with:

$$ \begin{aligned} b(\theta_i) &= \log(1+\text{exp}(\theta_i)) \\ a_i(\phi) &= 1/n_i \\ c(y_i, \phi) &= \log \binom{n_i}{n_iy_i}. \end{aligned} $$

Since $\theta_i = \text{logit}(p_i)$, the canonical link function is the logit function, and (2) and (3) gives

$$ \begin{aligned} \E(Y_i) &= b'(\theta_i) = \frac{\text{exp}(\theta_i)}{1+\text{exp}(\theta_i)} = p_i \\ \text{Var}(Y_i) &= b''(\theta_i)\, a_i(\phi) = \frac{\text{exp}(\theta_i)}{\left[1+\text{exp}(\theta_i)\right]^2 n_i} = \frac{p_i(1-p_i)}{n}. \end{aligned} $$

Denote by $l(\bs{\beta})$ the log-likelihood function for the regression coefficients. Given independent training data $y_1,..., y_n$, where each $y_i \sim p(y_i \given \theta_i)$, so from (1) the log-likelihood is:

$$ l(\bs{\beta}) =\sum_i l_i(\bs{\beta}) = \sum_i \frac{y_i\theta_i - b(\theta_i)}{a_i(\phi)} + \sum_i c(y_i, \phi). \tag{5} $$

Recall $\mu_i = b'(\theta_i)$ and $\text{Var}(Y_i) = b''(\theta_i)\,a_i(\phi)$ from (2) and (3). Hence, by the chain rule, we have

$$ \begin{aligned} \pfrac{l_i}{\beta_j} &= \pfrac{l_i}{\theta_i}\pfrac{\theta_i}{\mu_i}\pfrac{\mu_i}{\eta_i}\pfrac{\eta_i}{\beta_j} \\ &= \frac{y_i- \mu_i}{a_i(\phi)} \frac{a_i(\phi)}{\text{Var}(Y_i)}{\pfrac{\mu_i}{\eta_i} x_{ij}} \\ &= \frac{(y_i-\mu_i) x_{ij}}{\text{Var}(Y_i)} \pfrac{\mu_i}{\eta_i}. \end{aligned} $$

The `likelihood equations`

are

$$ \sum_{i=1}^n \frac{(y_i-\mu_i) x_{ij}}{\text{Var}(Y_i)} \pfrac{\mu_i}{\eta_i} = 0, \quad j=1,..., p. \tag{6} $$

Note that although $\bs{\beta}$ does not appear in (6), it is implicitly represented through $\mu_i$, since

$$ \mu_i = g^{-1}(\eta_i) = g^{-1} \left(\sum_j \beta_j x_{ij}\right), $$

so that different link functions $g$ yield different likelihood equations. Next we talk about methods of finding the MLE.

One way to find the MLE is to use the `Newton-Raphson method`

, which is an iterative method for unconstrained optimization. The basic idea is to start with an initial guess for the solution. It obtains the second guess by approximating the function by a second-degree polynomial at the first guess, and then find the location of that polynomial's optimal value. It then repeat the process at the new guess.

Let's briefly discuss the formal details of the method. The objective is to find the value $\widehat{\bs{\beta}}$ that maximizes the function $l(\bs{\beta})$ described in (5). Let $\bs{u}$ be the gradient of the log-likelihood, i.e.

$$ \bs{u} = \nabla l(\bs{\beta}) = \begin{pmatrix} \pfrac{l(\bs{\beta})}{\beta_1} \\ \vdots \\ \pfrac{l(\bs{\beta})}{\beta_p} \end{pmatrix}, $$

where each component is given by (6).

Also, let $\bs{H}$ denote the `Hessian matrix`

of the log-likelihood, i.e.

$$ \bs{H} = \begin{pmatrix} \frac{\partial^2 l(\bs{\beta})}{\partial \beta_1 \partial \beta_2} & \cdots & \frac{\partial^2 l(\bs{\beta})}{\partial \beta_1 \partial \beta_p} \\ \vdots & \ddots & \vdots \\ \frac{\partial^2 l(\bs{\beta})}{\partial \beta_p \partial \beta_1} & \cdots & \frac{\partial^2 l(\bs{\beta})}{\partial \beta_p \partial \beta_p} \end{pmatrix} $$

At the iteration $t$ of the algorithm, let $\bs{u}^{(t)}$ and $\bs{H}^{(t)}$ be the gradient and Hessian evaluated at $\bs{\beta}^{(t)}$. For iteration $t=0,1,2,...,N$, we approximate the log-likelihood near $\bs{\beta}^{(t)}$ by the second order `Taylor series expansion`

:

$$ \tilde{l}(\bs{\beta}) = l(\bs{\beta}^{(t)}) + {\bs{u}^{(t)}}\tr (\bs{\beta} - \bs{\beta}^{(t)}) + \frac{1}{2}(\bs{\beta} - \bs{\beta}^{(t)})\tr \bs{H}^{(t)} (\bs{\beta} - \bs{\beta}^{(t)}). $$

Setting the gradient of the approximatino $\tilde{l}(\bs{\beta})$ equal to 0, we have

$$ \nabla \tilde{l}(\bs{\beta}) = \bs{u}^{(t)} + \bs{H}^{(t)}(\bs{\beta} - \bs{\beta}^{(t)}) = \bs{0}. $$

Solving for $\bs{\beta}$ yields the following update

$$ \bs{\beta}^{(t+1)} = \bs{\beta}^{(t)} - (\bs{H}^{(t)})^{-1} \bs{u}^{(t)}. \tag{7} $$

Instead of using the Hessian matrix in the update (7), we can use the expected value of the matrix, also known as the `Fisher information matrix`

$\mathcal{I}$ whose $ij$-th entry has the form

$$ \mathcal{I}_{ij} = \E\left(-\frac{\partial^2 l(\bs{\beta})}{\partial \beta_i \partial \beta_j}\right). $$

For the exponential family, we have the useful result

$$ \E\left(-\frac{\partial^2 l_i}{\partial \beta_h \partial \beta_j}\right) = -\E \left(\pfrac{l_i}{\beta_h}\right)\left(\pfrac{l_i}{\beta_j}\right). $$

Thus, using (6), we have

$$ \begin{aligned} \E\left(-\frac{\partial^2 l_i}{\partial \beta_h \partial \beta_j}\right) &= -\E\left[\frac{(Y_i - \mu_i)x_{ih}}{\text{Var}(Y_i)} \pfrac{\mu_i}{\eta_i} \frac{(Y_i-\mu_i)x_{ij}}{\text{Var}(Y_i)}\pfrac{\mu_i}{\eta_i} \right] \\ &= \frac{-x_{ih}x_{ij}}{\text{Var}(Y_i)} \left(\pfrac{\mu_i}{\eta_i}\right)^2. \end{aligned} $$

Since $l(\bs{\beta}) = \sum_i l_i$,

$$ \E\left(-\frac{\partial^2 l(\bs{\beta})}{\partial \beta_h \partial \beta_j}\right) = \sum_{i=1}^n \frac{x_{ih}x_{ij}}{\text{Var}(Y_i)} \left(\pfrac{\mu_i}{\eta_i}\right)^2. $$

In matrix form, we have that for the exponential family, the Fisher information matrix is

$$ \mathcal{I} = \bs{X}\tr \bs{W} \bs{X}, \quad \bs{W} = \text{diag}\left(\frac{(\partial \mu_i / \partial \eta_i)^2}{\text{Var}(Y_i)} \right). \tag{8} $$

Hence, equation (7) can be modified as

$$ \bs{\beta}^{(t+1)} = \bs{\beta}^{(t)} + (\mathcal{I}^{(t)})^{-1} \bs{u}^{(t)} $$

or

$$ \mathcal{I}^{(t)} \bs{\beta}^{(t+1)} = \mathcal{I}^{(t)} \bs{\beta}^{(t)} + \bs{u}^{(t)}, \tag{9} $$

where $\mathcal{I}^{(t)}$ is the Fisher information matrix evaluated at $\bs{\beta}^{(t)}$.

Recall from (6), we have that each element of $\bs{u}$ is given by

$$ u_j = \sum_{i=1}^n \frac{(y_i-\mu_i) x_{ij}}{\text{Var}(Y_i)} \pfrac{\mu_i}{\eta_i}, \quad j=1,..., p. \tag{10} $$

Combining (10) with (8), we have that

$$ u_j = \sum_{i=1}^n (y_i- \mu_i)x_{ij} W_{ii} \pfrac{\eta_i}{\mu_i}. $$

Vectorizing the expression above, we have

$$ \bs{u} = \bs{X}\tr \bs{W} \bs{v}, \tag{11} $$

where $\bs{v} \in \mathbb{R}^n$ is a vector whose $i$-th entry is given by

$$ v_i = (y_i-\mu_i) \pfrac{\eta_i}{\mu_i}. $$

Combining (8) and (11), we have that

$$ \begin{aligned} \mathcal{I}^{(t)} \bs{\beta}^{(t)} + \bs{u}^{(t)} &= \bs{X}\tr \bs{W}^{(t)} \bs{X}\bs{\beta}^{(t)} + \bs{X}\tr \bs{W}^{(t)} \bs{v}^{(t)}\\ &=\bs{X}\tr \bs{W}^{(t)} (\bs{X} \bs{\beta}^{(t)} + \bs{v}^{(t)}) \\ &=\bs{X}\tr \bs{W}^{(t)} \bs{z}^{(t)}, \end{aligned} \tag{12} $$

where $\bs{z}^{(t)}$ has elements

$$ \begin{aligned} z^{(t)}_i &= \sum_{j=1}^p x_{ij} \beta_j^{(t)} + (y_i-\mu_i^{(t)}) \ \pfrac{\eta_i^{(t)}}{\mu_i^{(t)}} \\ &= \eta_i^{(t)} + (y_i - \mu_i^{(t)}) \pfrac{\eta_i^{(t)}}{\mu_i^{(t)}} \end{aligned} \tag{13} $$

Comparing (9) and (12), we have that the Fisher scoring have the form

$$ (\bs{X}\tr \bs{W}^{(t)} \bs{X})\,\bs{\beta}^{(t+1)} = \bs{X}\tr \bs{W}^{(t)} \bs{z}^{(t)}. \tag{14} $$

Treating $\bs{\beta}^{(t+1)}$ as the unknown, we see that (14) is precisely the `normal equations`

for the weighted least squares fit for the variable $\bs{z}^{(t)}$, whose solution has the very familiar looking form

$$ \bs{\beta}^{(t+1)} = (\bs{X}\tr \bs{W}^{(t)} \bs{X})^{-1} \bs{X}\tr \bs{W}^{(t)} \bs{z}^{(t)}. \tag{15} $$

Fun Observation: The vector $\bs{z}$ in (13) is the linearization of the link function $g$ evaluated at $\bs{y}$.$$g(y_i) \approx g(\mu_i) + (y_i-\mu_i) g'(\mu_i) = \eta_i + (y_i - \mu_i) \pfrac{\eta_i}{\mu_i} = z_i. $$

We refer to $\bs{z}$ as the `adjusted response variable`

. The iterative cycle regresses $\bs{z}^{(t)}$ on the data $\bs{X}$ using the weights $\bs{W}^{(t)}$ that is the inversely proportional to the variance of the adjusted response variables to obtain a new estimate $\bs{\beta}^{(t+1)}$. This estimate yields a new linear predictor value $\bs{\eta}^{(t+1)} = \bs{X} \bs{\beta}^{(t+1)}$, and a new adjusted response value $\bs{z}^{(t+1)}$ is calculated for the next iteration. This process is known as the `iterative weighted least squares`

(IWLS). We summarize the algorithm below.

Algorithm 1(Iterated Weighted Least Squares)

- Initialize the parameter $\bs{\beta}$. (A simple way to begin the process is to use the data $\bs{y}$ as the initial estimate of $\mu$.)
- Calculate the estimated linear predictor $\bs{\eta} = \bs{X} \bs{\beta}$, and the fitted means $\bs{\mu} = g^{-1}(\bs{\eta})$.
- Using these quantities, we calculate the adjusted response variable $\bs{z}$ given by $$z_i = \eta_i + (y_i - \mu_i) \pfrac{\eta_i}{\mu_i}. \tag{16}$$
- We calculate the iterative weights $$w_i = \left(\frac{(\partial \mu_i / \partial \eta_i)^2}{\text{Var}(Y_i)} \right) = \left[b''(\theta_i) \left(\pfrac{\eta_i}{\mu_i}\right)^2 a_i(\phi)\right]^{-1} \tag{17}$$
- Finally, we obtain an improved estimate of $\bs{\beta}$ by regressing the adjusted response variable $\bs{z}$ on the predictors $\bs{X}$ using the weights $\bs{W}$. We update $\bs{\beta}$ using the weighted least square estimator, $$\bs{\beta} \leftarrow (\bs{X}\tr \bs{W} \bs{X})^{-1}\bs{X}\tr \bs{W}\bs{z}. \tag{18}$$
- The procedure is repeated until convergence.

This is the frequentist approach of fitting generalized linear models. In the next post, we will talk about the Bayesian approach based on the IWLS procedure.