Generalized Linear Model IV

Dec. 6, 2018 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\DeclareMathOperator{\E}{\mathbb{E}}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

We start by introducing the simple Poisson log-linear model, where the Poisson means $\lambda_{i}$ is related to the covariates via the log-linear model

$$ \begin{aligned} g(\lambda_{i}) &= \eta_{i} = \log \lambda_{i} = \bs{x}_{i}\tr \bs{\theta} \\ i &= 1,..., n \end{aligned}\tag{1} $$

Before doing any analysis, it helps to visualize the situation. We generate a dataset whose truth is given by (1).

In [1]:
data.load <- function(p, theta, seed=2, n=100) {   
#     -----------
#     theta = fixed effect
#     n = number of points
#     p = number of columns of design matrix  
#     -----------
    set.seed(seed)
    X <- rep(1, n)
    for (i in 1:(p-1)){
        X <- cbind(X, runif(n, 0, 5))
    }
    Y <- rpois(n, exp(as.matrix(X)%*%theta))
    cbind(X, Y)
}
In [5]:
mydata <- data.load(p=2, theta=c(0.1,0.6)) # generate dataset
options(repr.plot.width=4, repr.plot.height=4) # plotting
plot(mydata[,2], mydata[,3], xlab='X', ylab='Y', main='Poisson')

This is a classic example that breaks the assumptions of a linear regression, as the variances of the residuals keep on increasing. This also makes a lot of sense: As $X$ increases, the Poisson variance $\lambda$, which is equal to its mean, also increases.

Nested Random Effects

Let's move on to something more interesting. In many applications, data are naturally dispersed and divided in clusters. This calls for a nested linear mixed model. For ease of typing, we adopt a different notation. Instead of writing $\bs{x}_i^{(j)}$, which is the notation we used in the previous post, we use $\bs{x}_{ij}$ to denote the covariates for the $j$th training example in the $i$th subject. In general, a linear mixed model with nested random effect has the following link function:

$$ \begin{aligned} g(\mu_{ij}) = \eta_{ij} &= \bs{x}_{ij}\tr\bs{\theta} + \bs{z}_{ij}\tr\bs{\beta}_i + \bs{t}_{ij}\tr\bs{\gamma}_{ij} \\ i &= 1,...,n, \,\, j=1,..., m_i \end{aligned}\tag{2} $$

where $\bs{\theta}$ is the fixed effect, $\bs{\beta}_i$ is the random effect associated to each subject through covariates $\bs{z}_{ij}$, and $\bs{\gamma}_{ij}$ is the nested random effect associated to the $j$th training example in the $i$th subject through the covariates $\bs{t}_{ij}$.

The model is complete after imbuing it with the following prior distributions:

$$ \begin{aligned} \bs{\beta}_i \sim \mathcal{N}(\bs{0}, \bs{\Sigma}_1), &\quad \bs{\Sigma}_1 \sim \text{IW}(\nu_1, \bs{S}_1)\\ \bs{\gamma}_{ij} \sim \mathcal{N}(\bs{0}, \bs{\Sigma}_2) , &\quad \bs{\Sigma}_2 \sim \text{IW}(\nu_2, \bs{S}_2), \end{aligned} \tag{3} $$

where $\text{IW}(\nu, \bs{S})$ is the inverse Wishart distribution for $\bs{\Sigma}$ with the kernel $|\bs{\Sigma}|^{-\nu/2} \text{exp}(-\text{tr}(\bs{\Sigma}^{-1}\bs{S})/2)$.

Adding the prior $\mathcal{N}(\bs{a}, \bs{R})$ for the fixed effect, we have the following posterior distribution:

$$ \begin{aligned} &\quad \pi(\bs{\theta}, \bs{\beta}_1,..., \bs{\beta}_n, \bs{\gamma}_{11},...,\bs{\gamma}_{n,m_n}, \bs{\Sigma}_1, \bs{\Sigma}_2) \propto \\ & \text{exp} \left\{-\frac{1}{2}(\bs{\theta}-\bs{a})\tr \bs{R}^{-1}(\bs{\theta}-\bs{a}) + \sum_{i=1}^n \sum_{j=1}^{m_i} \frac{y_{ij}\theta_{ij} - b(\theta_{ij})}{\phi_{ij}}\right\} \\ \times\,\, & |\bs{\Sigma}_1|^{-n/2} \text{exp}\left\{-\frac{1}{2}\sum_{i=1}^n \bs{\beta}_i\tr \bs{\Sigma}_1^{-1}\bs{\beta}_i\right\} \times |\bs{\Sigma_2}|^{-n/2} \text{exp}\left\{-\frac{1}{2}\sum_{i=1}^n \sum_{j=1}^{m_i} \bs{\gamma}_{ij}\tr \bs{\Sigma}_2^{-1}\bs{\gamma}_{ij}\right\} \\ \times\,\, & |\bs{\Sigma}_1|^{-\nu_1/2} \text{exp}\left\{-\frac{1}{2}\text{tr}(\bs{\Sigma}_1^{-1}\bs{S}_1) \right\} \times |\bs{\Sigma}_2|^{-\nu_2/2} \text{exp}\left\{-\frac{1}{2}\text{tr}(\bs{\Sigma}_2^{-1}\bs{S}_2) \right\}. \end{aligned} \tag{4} $$

In the previous post, we developed the augmented technique for fitting a mixed model. We can do the same here by writing the augmented regression coefficients and design matrix as

$$ \begin{aligned} \bs{\beta}_{(a)} &= (\bs{\theta}\tr, \bs{\beta}_1\tr,...,\bs{\beta}_n\tr, \bs{\gamma}_{11}\tr,...,\bs{\gamma}_{1,n_1}\tr,...\bs{\gamma}_{n,m_n}\tr)' \\ \bs{X}_{(a)} &= \left(\bs{X} \given \text{diag}(\bs{Z}_1,..., \bs{Z}_n) \given \text{diag}(\bs{t}_{11}\tr,...,\bs{t}_{n,m_n}\tr) \right), \end{aligned} $$

where $\bs{X}=(\bs{x}_{11},...,\bs{x}_{n,m_n})'$ is the design matrix for the fixed effect, and $\bs{Z}_i = (\bs{z}_{11},...,\bs{z}_{1,m_1})'$ is the design matrix for the $i$th random effect.

So the model can be written as $\bs{\eta} = \bs{X}_{(a)} \bs{\beta}_{(a)}$ with the prior distributions given by

$$ \bs{\beta}_{(a)} \given \bs{\Sigma}_1, \bs{\Sigma}_2 \sim \mathcal{N}\left(\,\begin{pmatrix} \bs{a} \\ \bs{0} \\ \vdots \\ \bs{0} \end{pmatrix}, \quad \begin{pmatrix} \bs{R} & & & \\ & \bs{\Sigma}_1 & & \\ & & \ddots & \\ & & & \bs{\Sigma}_2 \end{pmatrix}\,\right). $$

From this point, the frequentist estimating procedure (such as IWLS) can be used to estimate the augmented coefficients. However, the dimensionality of the parameter space becomes very large with any reasonably large dataset, so we need to invert very large matrices.

In the Bayesian paradigm, we can use the Metropolis-within-Gibbs updating scheme. The idea is to divide the parameter space into naturally correlated blocks and sample each block using either Gibbs sampling (if possible) or the Metropolis-Hastings algorithm.

Recall, the adjusted response variable $\tilde{y}_{ij}$ is given by

$$ \tilde{y}_{ij}^{(t-1)}=\eta_{ij}^{(t-1)}+(y_{ij}-\mu_{ij}^{(t-1)})g'(\mu_{ij}^{(t-1)}). \tag{5} $$

For the $\bs{\theta}$ block, the IWLS sampling can be used, except that the link function now includes the offsets $\bs{z}_{ij}\tr\bs{\beta}_i + \bs{t}_{ij}\tr\bs{\gamma}_{ij}$. The proposal density now becomes $\mathcal{N}(\bs{m}^{(t)}, \bs{C}^{(t)})$, where

$$ \begin{aligned} \bs{m}^{(t)} &= (\bs{R}^{-1}+\bs{X}\tr\bs{W}\bs{X})^{-1} (\bs{R}^{-1}\bs{a} + \bs{X}\tr \bs{W} (\bs{\tilde{y}}-\bs{Z}\bs{B} - \bs{T}\bs{\Gamma})) \\ \bs{C}^{(t)} &= (\bs{R}^{-1} + \bs{X}\tr\bs{W}\bs{X})^{-1} \end{aligned} \tag{6} $$

where

$$ \begin{aligned} \bs{\tilde{y}} & =(\tilde{y}_{11},...,\tilde{y}_{n,m_n})' \\ \bs{Z} &= \text{diag}(\bs{Z}_1,...,\bs{Z}_n)\\ \bs{B} &= (\bs{\beta}_1\tr,...,\bs{\beta}_n\tr)' \\ \bs{T} &= \text{diag}(\bs{t}_{11}\tr,...,\bs{t}_{n,m_n}\tr) \\ \bs{\Gamma} &=(\bs{\gamma}_{11}\tr,..., \bs{\gamma}_{n,m_n}\tr)' \\ \bs{W} &= \text{diag}(w_{11},..., w_{n,m_n}) \\ \bs{X} &= (\bs{x}_{11},..., \bs{x}_{n, m_n})' \end{aligned} \tag{7} $$

For the $\bs{\beta}_i$ block, the full conditional distribution becomes

$$ \pi(\bs{\beta}_i) \propto \text{exp}\left\{-\frac{1}{2}\bs{\beta}_i\tr \bs{\Sigma_1}^{-1}\bs{\beta}_i + \sum_{j=1}^{m_i} \frac{y_{ij}\theta_{ij} - b(\theta_{ij})}{\phi_{ij}}\right\}. \tag{8} $$

We can use the same Metropolis IWLS sample scheme as before for $\bs{\theta}$. The offsets is now $\bs{x}_{ij}\tr \bs{\theta} + \bs{t}_{ij}\tr\bs{\gamma}_{ij}$. The proposal density is represented by $\mathcal{N}(\bs{m}^{(t)}_i, \bs{C}^{(t)}_i)$ where

$$ \begin{aligned} \bs{m}_i^{(t)} &= (\Sigma_1^{-1} + \bs{Z}_i\tr\bs{W}_i\bs{Z}_i)^{-1}\bs{Z}_i\tr\bs{W}_i(\bs{\tilde{y}}_i-\bs{X}_i\bs{\theta} - \bs{T}_i\bs{\Gamma}_i)\\ \bs{C}_i^{(t)} &= (\bs{\Sigma}_1^{-1} + \bs{Z}_i\tr\bs{W}_i \bs{Z}_i)^{-1} \end{aligned} $$

where

$$ \begin{aligned} \bs{\tilde{y}}_i & =(\tilde{y}_{i1},...,\tilde{y}_{i,m_i})' \\ \bs{Z}_i &= (\bs{z}_{i1},...,\bs{z}_{i, m_i})'\\ \bs{X}_i &= (\bs{x}_{i1},...,\bs{x}_{i,m_i})'\\ \bs{T}_i &= \text{diag}(\bs{t}_{i1}\tr,...,\bs{t}_{i, m_i}\tr) \\ \bs{\Gamma}_i &=(\bs{\gamma}_{i1}\tr,..., \bs{\gamma}_{i,m_i}\tr)' \\ \bs{W} &= \text{diag}(w_{i1},..., w_{i,m_i}) \end{aligned} \tag{9} $$

For the $\bs{\gamma}_{ij}$ block, the full conditional distribution becomes

$$ \pi(\bs{\gamma}_{ij}) \propto \text{exp}\left\{-\frac{1}{2}\bs{\gamma}_{ij}\tr \bs{\Sigma_1}^{-1}\bs{\gamma}_{ij} + \frac{y_{ij}\theta_{ij} - b(\theta_{ij})}{\phi_{ij}}\right\}. \tag{10} $$

The offsets is now $\bs{x}_{ij}\tr \bs{\theta} + \bs{z}_{ij}\tr\bs{\beta}_i$. The proposal density is represented by $\mathcal{N}(\bs{m}^{(t)}_{ij}, \bs{C}^{(t)}_{ij})$ where

$$ \begin{aligned} \bs{m}_{ij}^{(t)} &= (\bs{\Sigma}_2^{-1} + \bs{t}_{ij}w_{ij}\bs{t}_{ij}\tr)^{-1}\bs{t}_{ij}w_{ij}(\tilde{y}_{ij}-\bs{x}_{ij}\tr\bs{\theta}-\bs{z}_{ij}\tr\bs{\beta}_i) \\ \bs{C}_{ij}^{(t)} &= (\bs{\Sigma}_2^{-1} + \bs{t}_{ij}w_{ij}\bs{t}_{ij}\tr)^{-1} \end{aligned}\tag{11} $$

The variance have full conditionals given by

$$ \begin{aligned} \bs{\Sigma}_1 &\sim \text{IW}(\nu_1+n, \bs{S}_1 + \sum_i \bs{\beta}_i \bs{\beta}_i\tr) \\ \bs{\Sigma}_2 &\sim \text{IW}(\nu_2 + \sum_i m_i, \bs{S}_2 + \sum_{i,j}\bs{\gamma}_{ij} \bs{\gamma}_{ij}\tr). \end{aligned} \tag{12} $$