Mar. 11, 2019 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\newcommand{\st}{\,\big|\,}$ $\newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$ $\newcommand{\P}[1]{\mathbb{P}\left(#1\right)}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

Linear regression is a beautiful subject. Nowadays it seems that classical methods such as linear regression is attracting less attention. But in reality, linear methods is the fundamental building block for a majority of supervised learning techniques, so it will always be useful!

Machine learning is primarily concerned with functional approximation and estimation of the form

$$ y_i = f(x_i) + \epsilon_i, \tag{1} $$

and our goal is to estimate $f$ for all $x$ in the domain based on the data at hand. For example, a single layer feedforward neural network for regression can be viewed as linear basis expansions of nonlinear transformation of the input vector:

$$ \hat{f}(x;\theta) = \sum_{i=1}^K h_i(x)\,\theta_i, $$ where $h_i = \sigma(x\tr\beta_i)$, and $\sigma(\cdot)$ is the sigmoid function.

We want to estimate the parameters $\theta_i$. The most popular method is using the maximum likelihood principle. Assume that data $D = \{(x_1, y_1),...,(x_n, y_n)\}$ are iid, then the empirical distribution $\hat{p}(x \given y)$ is given by

$$ \hat{p} = \frac{1}{n}\, I((x,y)\in D). $$

We choose $\hat{\theta}$ that minimizes the KL divergence between the empirical distribution and the modeled distribution $p_{\theta}(y \given x)$, given by

$$ \mathbb{E}_{\hat{p}} \log\frac{\hat{p}}{p} = -\log n - \frac{1}{n}\sum_{i=1}^n \log p_{\theta}(y_i \given x_i) $$

If we assume that the modeled conditional likelihood is

$$ p_{\theta}(y\given x) = N(f_\theta(x), \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma} \text{exp}\left(-\frac{(y-f_{\theta}(x))^2}{2\sigma^2} \right), $$

then the KL divergence becomes

$$ D_{KL} = -\log n + \frac{1}{2}\log(2\pi) + \log\sigma + \frac{1}{2n\sigma^2} \sum_{i=1}^n(y_i-f_{\theta}(x_i))^2, $$

so we see that minimizing the KL divergence becomes the same as minimizing the `residual sum of squares (RSS)`

.

For classification problems, assume that $Y$ follows a generalized Bernoulli distribution taking values in $\{c_1,...,c_K\}$ with probabilities given by $\pi_{\theta,c_k}(x)$. Then the modeled conditional likelihood becomes

$$ p_{\theta}(y\given x) = \prod_{k=1}^K \pi_{\theta,c_k}(x)^{I(y=c_k)}. $$

Hence, the KL divergence becomes

$$ \begin{aligned} D_{KL} &= -\log n - \frac{1}{n} \sum_{i=1}^n \sum_{k=1}^K\log \pi_{\theta,c_k}(x_i) \,I(y_i=c_k) \\ &= -\log n - \frac{1}{n} \sum_{i=1}^n \log \pi_{\theta,y_i}(x_i). \end{aligned} $$

In practice, we often use the `softmax function`

to model the class probabilities. Let $\theta=(\beta(c_1), \beta(c_2),..., \beta(c_k))$, where $\beta(c_k)$ denotes a set of parameters for class $c_k$. Then, the class probabilities can be written as

$$ \pi_{\theta, c_k}(x) = \frac{\text{exp}(\beta(c_k)\tr x)}{\sum_{j=1}^K \text{exp}(\beta(c_j)\tr x)}. $$

This implies the following loss function for the softmax regression:

$$ J(\theta) = -\sum_{i=1}^n \log \frac{\text{exp}(\beta(y_i)\tr x_i)}{\sum_{j=1}^K \text{exp}(\beta(c_j)\tr x_i)} \tag{2} $$

Note that as a special case when $K=2, c_1=0, c_2=1$, we have that

$$ \begin{aligned} \pi_{\theta,0}(x) &= \frac{\text{exp}(\beta(0)\tr x)}{\text{exp}(\beta(0)\tr x) + \text{exp}(\beta(1)\tr x)} \\ &= \frac{1}{1+\text{exp}(\beta\tr x)} \end{aligned} $$

where $\beta = \beta(1)-\beta(0)$. Similarly we have that

$$ \pi_{\theta,1}(x) = 1-\pi_{\theta,0}(x) = \frac{1}{1+\text{exp}(-\beta\tr x)} = \sigma(\beta\tr x). $$

This leads to the well-known cross-entropy loss function for the `logistic regression`

:

$$ J(\theta) = -\sum_{i=1}^n \left\{(1-y_i)\log (1-\sigma(\beta\tr x)) + y_i \log \sigma(\beta\tr x)\right\} \tag{3} $$

In general, to find the gradients of $J(\theta)$, consider a single parameter $\beta_l(c_k)$ which denotes the $l$th element in the parameter $\beta(c_k)$. Taking the partial derivative, we have

$$ \begin{aligned} \pfrac{J(\theta)}{\beta_l(c_k)} &= -\sum_{i=1}^n \left\{\frac{\sum_{j=1}^K \text{exp}\left(\beta(c_j)\tr x_i\right)}{\text{exp}\left(\beta(y_i)\tr x_i\right)} \cdot \pfrac{\pi_{\theta, y_i}(x_i)}{\beta_l(c_k)}\right\} \\ &= -\sum_{i=1}^n \left\{\frac{S_i}{\text{exp}\left(\beta(y_i)\tr x_i\right)} \cdot \pfrac{\pi_{\theta, y_i}(x_i)}{\beta_l(c_k)}\right\}, \end{aligned} \tag{4} $$

where $S_i=\sum_{j=1}^K \text{exp}\left(\beta(c_j)\tr x_i\right)$ for ease of notation. The inner partial derivative can be obtained similarly using the chain rule:

$$ \pfrac{\pi_{\theta,y_i}(x_i)}{\beta_l(c_k)} = \frac{\text{exp}(\beta(y_i)\tr x_i)\cdot I(y_i=c_k)\cdot x_{il}\cdot S_i - \text{exp}(\beta(c_k)\tr x_i)\cdot x_{il}\cdot \text{exp}(\beta(y_i)\tr x_i)}{S_i^2}. \tag{5} $$

Substitute (5) into (4) and simplify yield

$$ \pfrac{J(\theta)}{\beta_l(c_k)} = -\sum_{i=1}^n x_{il}\left(I(y_i=c_k) - \pi_{\theta,c_k}(x_i)\right). $$

This can be easily vectorized as

$$ \nabla_{\beta(c_k)} J(\theta) = -\sum_{i=1}^n x_i \left(I(y_i=c_k) - \pi_{\theta, c_k}(x_i)\right). \tag{6} $$

Equation (3) enables easy implementation of the gradient descent optimization.

Now suppose that we want to find a $\hat{f}$ that minimizes of RSS of $f$:

$$ RSS(f) = \sum_{i=1}^n (y_i-f(x_i))^2. \tag{7} $$

This is a challenging task. First of all, there are infinitely many solutions to (7). Secondly, the estimation is entirely dependent on data. In order to obtain a unique solution, we must make assumptions that help us choose a set of functions from which to optimize. Often this is done outside of the data; hence, in a sense, model specification is analogous to specifying a prior distribution in the Bayesian paradigm. For instance, on the far end of non-parametric methods, we have the `K-nearest neighbors`

. Choosing this method also implicitly demonstrates our prior belief (or assumption) that $\widehat{f}$ exhibits homogeneous local behavior. Locally weighted least squares assumes that $\widehat{f}$ is locally linear. In the following sections we start by exploring linear methods for regression.

Linear regression is an optimilzation problem whose objective function is given by

$$ RSS(\beta) = (\bs{y}-\bs{X}\beta)\tr(\bs{y}-\bs{X}\beta), \tag{8} $$

where $\bs{y}=(y_1,...,y_n)'$ is the response vector, $\bs{X}=(\bs{1},\bs{x}_1,...,\bs{x}_p)$ is the $n\times(p+1)$ design matrix, and $\beta=(\beta_0,\beta_1,...,\beta_p)'$ is the vector of regression coefficients.

This is quadratic function, so we know that the optimal value exists and is unique. Differentiating with respect to $\beta$ and setting it equal to 0 gives

$$ \pfrac{RSS}{\beta} = -2\bs{X}\tr(\bs{y}-\bs{X}\beta) \overset{\text{set}}{=} 0. $$

Suppose $\widehat{\beta}$ solves the equation above. Then we must have that

$$ \bs{X}\tr(\bs{y}-\bs{X}\widehat{\beta}) = 0. \tag{10} $$

Observation 1The residual vector $\bs{R}=\bs{y}-\bs{X}\widehat{\beta}$ is orthogonal to the column space of $\bs{X}\tr$ spanned by the covariate vectors $\bs{1}, \bs{x}_1,..., \bs{x}_p$. This shows that the sum of the residuals is equal to 0.

From (10) we obtain

$$ \hat{\beta}=(\bs{X}\tr \bs{X})^{-1}\bs{X}\tr \bs{y}. \tag{11} $$

Multiplying $\hat{\beta}$ by $\bs{X}$ we obtain the predicted values

$$ \bs{\hat{y}} = \bs{X}\hat{\beta} = \bs{X}(\bs{X}\tr\bs{X})^{-1}\bs{X}\tr \bs{y} = \bs{H}\bs{y}. \tag{12} $$

Observation 2The matrix $\bs{H}$ is the`projection matrix`

that projects the response vector $\bs{y}$ onto the column space of $\bs{X}$.

Suppose we are fitting a constant model $Y = \beta_0 + \epsilon$. Then substituting $\bs{X} = \bs{1}$ into (11) yields

$$ \hat{\beta_0} = \overline{y}. $$

This shows that the sample mean is the least squares estimate for a constant model.

Suppose now that all columns of $\bs{X}$ are orthogonal; that is $\left<\bs{x}_i, \bs{x}_j\right>=0$ for all $i\neq j$. Then we can write $\bs{X} = QD$, where $Q$ is an orthogonal matrix such that $Q\tr Q=I$ and $D$ is a diagonal matrix such that $D_{ii} = ||\bs{x}_i||$. Then substituting this into $(11)$ yields

$$ \begin{aligned} \hat{\beta} &= ((QD)\tr QD)^{-1}(QD)\tr \bs{y} \\ &= (D Q\tr QD)^{-1} D Q\tr \bs{y} \\ &= D^{-2} D Q\tr \bs{y} \\ &= D^{-2}\bs{X}\tr \bs{y} \end{aligned} $$

This shows that projection onto a set of orthogonal vectors can be done elementwise:

$$ \hat{\beta}_j = \frac{\left<\bs{y}, \bs{x}_j\right>}{\left<\bs{x}_j, \bs{x}_j\right>}. \tag{13} $$

This suggests the `Gram-Schmidt orthogonalization process`

:

Algorithm 1Consider $Y=\bs{X}\beta+\bs{\epsilon}$.

- Initialize $\bs{z}_0 = \bs{1}$.
- $\text{for } j=1,2,...,p$

- Project $\bs{x}_j$ onto $\bs{z}_k$ to obtain coefficients $$\gamma_{kj} = \frac{\left<\bs{x}_j, \bs{z}_k\right>}{\left<\bs{z}_k, \bs{z}_k\right>}, \quad \text{ for } k=0,1,...,j-1.$$
- Compute the residual $$\bs{z}_j = \bs{x}_j-\sum_{k=0}^{j-1} \gamma_{kj}\bs{z}_k.\tag{14}$$
- Project $\bs{y}$ onto the last residual $\bs{z}_p$ to obtain $\hat{\beta}_p$; that is $$\hat{\beta}_p = \frac{\left<\bs{y}, \bs{z}_p\right>}{\left<\bs{z}_p, \bs{z}_p\right>}. \tag{15}$$

Rearranging equation (14), we see that $\bs{x}_j$ is in the span of the orthogonal basis $\bs{z}_0,...,\bs{z}_j$. This shows that $\bs{z}_j$'s also span the column space; hence the projection is the same up to a change of basis. Moreover, we see that the vector $\bs{z}_p$ is the only vector (among the $\bs{z}_j$'s) that involves the vector $\bs{x}_p$, with coefficient 1. Hence, the coefficient obtained by projecting $\bs{y}$ onto $\bs{z}_p$ must be the same as $\hat{\beta}_p$.

Observation 3The coefficient $\hat{\beta}_p$ represents the additional contribution of $p$th covariate $\bs{x}_p$ on the response $\bs{y}$, after removing its correlation with the other covariates.

This is a beautiful observation. Ideally, we want the covariates to be linearly independent, making the regression coefficient exactly equal to its contribution to the response. If $\bs{x}_p$ is highly correlated with the other covariates, then the residual $\bs{z}_p$ will be small in magnitude. This leads to a high variance in the estimate for the coefficient, since by taking the variance on (15) we have

$$ \text{Var}(\hat{\beta}_p) = \frac{\text{Var}\left<\bs{y}, \bs{z}_p\right>}{\left<\bs{z}_P, \bs{z}_p\right>^2} = \frac{\sigma^2}{\left<\bs{z}_p, \bs{z}_p\right>}, $$

which becomes large if $||\bs{z}_p|| = \sqrt{\left<\bs{z}_p, \bs{z}_p\right>}$ is small.

If we set $\gamma_{jj}=1$, equation (14) can be written as

$$ \bs{x}_j = \bs{z}_j + \sum_{k=0}^{j-1} \gamma_{kj}\bs{z}_k = \sum_{k=0}^j \gamma_{kj}\bs{z}_k, $$

which can be cleverly expressed in the matrix form

$$ \bs{X} = Z\Gamma := \begin{bmatrix} . & . & \cdots & . \\ \bs{z}_0 & \bs{z}_1 & \cdots & \bs{z}_p \\ . & . & \cdots & . \end{bmatrix}\begin{bmatrix} \gamma_{00} & \gamma_{01} & \gamma_{02} & \cdots & \gamma_{0p} \\ 0 & \gamma_{11} & \gamma_{12} & \cdots & \gamma_{1p} \\ 0 & 0 & \gamma_{22} & \cdots & \gamma_{2p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \gamma_{pp} \end{bmatrix} $$

If we let $Z=QD$ where $Q$ is an orthogonal matrix, and $D$ is the diagonal matrix such that $D_{ii} = ||\bs{z}_i||$, then we have that

$$ \bs{X} = ZD^{-1}D\Gamma = QR. $$

This is the QR-factorization of $\bs{X}$. Using the new representation, the least squares solution can be written as

$$ \hat{\beta} = R^{-1} Q\tr \bs{y} \tag{16} $$

The final question we want to ask is:

How do we uncover all the coefficients of $\hat{\beta}$ from a single run of the Gram Schmidt process?

Evaluating (16) is equivalent to solving the following upper triangular system

$$ R\beta = Q\tr \bs{y} =: \bs{b}. \tag{17} $$

Let $Q=(\bs{q}_1,...,\bs{q}_p)$, where $\bs{q}_i = \bs{z}_i/||\bs{z}_i||$ are columns of $Q$ obtained by normalizing the columns of $Z$. Let $R=(r_{ij}) = D\Gamma$, so that $r_{ij} = \gamma_{ij} ||\bs{z}_i||$. From (13), we can clearly see that $\bs{b}$ is the projection of $\bs{y}$ onto the columns of $\bs{Q}$, which have unit length; that is,

$$ b_j = \left<\bs{y}, \bs{q}_j\right>. $$

Using back substitution, we can uncover the components of $\hat{\beta}$ as follows. Starting from the bottom row, we have that $\hat{\beta}_p = b_p / r_{pp}$. The next equation above gives

$$ \begin{aligned} & r_{p-1,p-1} \hat{\beta}_{p-1} + r_{p-1, p} \hat{\beta}_p = b_{p-1} \\ \implies & \hat{\beta}_{p-1} = \frac{b_{p-1} - r_{p-1,p}\hat{\beta}_p}{r_{p-1,p-1}}. \end{aligned} $$

A pattern emerges. For $j=1,2,...,p$,

$$ \hat{\beta}_{p-j} = \frac{b_{p-j} - \sum_{k=1}^j r_{p-k,p-j+1}\hat{\beta}_{p-j+1}}{r_{b-j,b-j}}. $$

We conclude this section by presenting the complete Gram Schmidt altorighm for finding all components of $\hat{\beta}$!

Algorithm 2Consider $Y=\bs{X}\beta+\bs{\epsilon}$.

- Initialize $\bs{z}_0 = \bs{1},\, \bs{q}_0 = \bs{1}/\sqrt{n},\, b_0 = \left<\bs{y}, \bs{q}_0\right>$.
- $\text{for } j=1,2,...,p$

- Project $\bs{x}_j$ onto $\bs{z}_k$ to obtain coefficients $$\gamma_{kj} = \frac{\left<\bs{x}_j, \bs{z}_k\right>}{\left<\bs{z}_k, \bs{z}_k\right>}, \quad r_{kj} = \gamma_{kj} ||\bs{z}_k||, \quad \text{ for } k=0,1,...,j-1.$$
- Compute the residual $$\bs{z}_j = \bs{x}_j-\sum_{k=0}^{j-1} \gamma_{kj}\bs{z}_k. $$
- Normalize $\bs{z}_j$ to obtain the orthonormal basis $$\bs{q}_j = \frac{\bs{z}_j}{||\bs{z}_j||}.$$
- Compute the projection of $\bs{y}$ onto the orthonormal basis $$b_j = \left<\bs{y}, \bs{q}_j\right>.$$
- For $j=0,1,...,p$, set $r_{jj} = ||z_j||$.
- Initialize $\hat{\beta}_p = b_p/r_{pp}$.
- $\text{for } j=1,...,p$, uncover all the other coefficients by $$ \hat{\beta}_{p-j} = \frac{b_{p-j} - \sum_{k=1}^j r_{p-k,p-j+1}\hat{\beta}_{p-j+1}}{r_{b-j,b-j}}. $$