Penalized Regression

Mar. 31, 2019 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\newcommand{\st}{\,\big|\,}$ $\newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$ $\newcommand{\P}[1]{\mathbb{P}\left(#1\right)}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

Recall that for the regression model $y=X\beta +\epsilon$, where

  • $X$ is the $n\times p$ design matrix with rank $p$,
  • $\E{\epsilon \given X} = 0$,
  • $\text{Var}\left(\epsilon \given X\right) = \sigma^2 I$,

the OLS estimator of regression coefficient is given by

$$ \begin{aligned} \hat{\beta}_{OLS} &= (X\tr X)^{-1}X\tr y \\ &= (X\tr X)^{-1}X\tr (X\beta + \epsilon) \\ &= \beta + (X\tr X)^{-1}X\tr \epsilon. \end{aligned}\tag{1} $$

By the law of interated expectation, the expected value of $\hat{\beta}_{OLS}$ is

$$ \begin{aligned} \E{\hat{\beta}_{OLS}} &= \E{\E{\hat{\beta}_{OLS} \given X}} \\ &= \E{\E{\beta + (X\tr X)^{-1} X\tr \epsilon \given X}} \\ &= \E{\beta + 0} \\ &= \beta. \end{aligned} $$

This shows that $\hat{\beta}_{OLS}$ is an linear unbiased estimator. Furthermore, it was Gauss who first found that $\hat{\beta}_{OLS}$ is also the best linear unbiased estimator (BLUE). This means that for any other linear unbiased esitmator $\hat{\beta}$, the difference of two covariance matrices

$$ \Sigma - \Sigma_{OLS} := \text{Var}\left(\hat{\beta} \given X\right) - \text{Var}\left(\hat{\beta}_{OLS} \given X\right) \tag{2} $$

is positive semi-definite. This means that for any vector $z \in \mathbb{R}^p$,

$$ \begin{aligned} & z\tr (\Sigma - \Sigma_{OLS}) z \geq 0 \\ \implies & z\tr \text{Var}\left(\hat{\beta}_{OLS}\given X\right) z \leq z\tr \text{Var}\left(\hat{\beta} \given X\right)z \\ \implies & \text{var}\left(z\tr \hat{\beta}_{OLS} \given X\right) \leq \text{var}\left(z\tr \hat{\beta}\given X \right). \end{aligned} \tag{3} $$

Note here var denotes the variance while Var denotes the covariance matrix. We also used the fact that $\text{var}(z\tr x) = z\tr\text{Var}(x)\,z$ for a random vector $x$ and a constant vector $z$. This is because

$$ \begin{aligned} \text{var}(z\tr x) &= \E{\left(z\tr x - \E{z\tr x}\right)^2} \\ &= \E{\left(z\tr x - z\tr\E{x}\right)^2} \\ &= \E{\left(z\tr \left(x - \E{x}\right)\right)^2} \\ &= \E{z\tr \left(x - \E{x}\right) \left(x - \E{x}\right)\tr z} \\ &= z\tr\E{\left(x - \E{x}\right) \left(x - \E{x}\right)\tr}z \\ &= z\tr \text{Var}(x)\, z. \end{aligned} $$

Equation (3) means that for a fixed $z$, the variance of the linear combination $z\tr\hat{\beta}_{OLS}$ is the smallest among all $z\tr \hat{\beta}$ for any other linear unbiased estimator $\hat{\beta}$. This automatically implies that

Remark 1   Each component of $\hat{\beta}_{OLS}$ has the smallest variance ($\left[\Sigma_{OLS}\right]_{ii}$) among the same component of all linear unbiased estimators of $\beta$.

The covariance matrix for the OLS estimator can be easily derived:

$$ \begin{aligned} \Sigma_{OLS} &= \text{Var}\left(\hat{\beta}_{OLS} \given X\right) \\ &= \text{Var}\left((X\tr X)^{-1}X\tr \epsilon \given X\right) \\ &= \left[(X\tr X)^{-1} X\tr\right] \text{Var}(\epsilon) \left[(X\tr X)^{-1} X\tr\right]\tr \\ &= \left[(X\tr X)^{-1} X\tr\right] \left(\sigma^2 I\right) \left[X(X\tr X)^{-1}\right] \\ &= \sigma^2\left(X\tr X\right)^{-1}. \end{aligned}\tag{4} $$

As a corollary, the mean squared error (MSE) of $\hat{\beta}_{OLS}$ is given by

$$ \begin{aligned} \text{MSE}(\hat{\beta}_{OLS}) &= \E{(\hat{\beta}_{OLS} - \beta)\tr (\hat{\beta}_{OLS} - \beta)} \\ &= \sum_{i} \text{var}\left(\left[\hat{\beta}_{OLS}\right]_{i}\right) \\ &=\sum_{i} \left[\Sigma_{OLS}\right]_{ii} \\ &= \sigma^2 \text{tr}\left[(X\tr X)^{-1}\right]. \end{aligned} \tag{5} $$

By Remark 1, the MSE of the OLS estimator is also the smallest among all other linear unbiased estimators. Does this mean that the OLS is the best among all estimators? Certainly not if we remove the restriction of unbiasedness! Let $Z$ denote the random input vector and $Y$ the corresponding response variable. As before, we assume the model

$$ Y=f(Z)+ \epsilon, \quad \E{\epsilon}=0, \,\, \text{var}(\epsilon)=\sigma^2. $$

Our goal is to find a good estimator $\hat{f}(\cdot)$ that performs well not only on the training data, but with respect to the entire data generating distribution.

Thus, we define the prediction error (PE) at a point $z_0$ as

$$ \begin{aligned} \text{PE}(z_0) &= \mathbb{E}_{Y\given Z=z_0}\left[(Y-\hat{f}(Z))^2 \given Z=z_0\right]\\ &= \E{\left(Y- f(Z) + f(Z) - \E{\hat{f}(Z)} + \E{\hat{f}(Z)} - \hat{f}(Z)\right)^2 \given Z=z_0} \\ &= \E{\left(Y-f(Z)\right)^2 + \left(f(Z) - \E{\hat{f}(Z)}\right)^2 + \left(\hat{f}(Z) - \E{\hat{f}(Z)}\right)^2 \given Z=z_0} \\ &= \E{\left(Y-f(z_0)\right)^2} + \left(f(z_0) - \E{\hat{f}(z_0)}\right)^2 + \E{\left(\hat{f}(z_0) - \E{\hat{f}(z_0)}\right)^2} \\ &= \sigma^2 + \text{bias}^2(\hat{f}(z_0)) + \text{var}(\hat{f}(z_0)), \end{aligned} \tag{6} $$

where the expectation is taken with respect to the conditional distribution $Y\given Z=z_0$, and the third equality holds by the fact that all the cross terms in the quadratic expansion have expected value of 0, hence are omitted. This is also known as the bias-variance trandeoff. For the OLS estimator, the bias is 0 because

$$ \begin{aligned} \E{\hat{f}(z_0)} &= \E{z_0\tr \hat{\beta}_{OLS}} \\ &= z_0\tr \E{\hat{\beta}_{OLS}} \\ &= z_0\tr \beta \\ &= f(z_0). \end{aligned} $$

This suggests the possibility that introducing a little bias in our estimate of $\beta$ might lead to a large mitigation in variance, and thus could substantially lower the prediction error!

Bayes Estimator

We take an unconventional path. First assume a multivariate normal model for the response vector $y=(y_1,...,y_n)'$:

$$ y \given X, \beta \sim \mathcal{N}\left(\bs{1}\beta_0 + X_*\beta_*, \sigma^2I\right), $$

where $X_*=[x_1,...,x_p]$ is the design matrix without the intercept column, and $\beta_*=(\beta_1,...,\beta_p)'$ is the coefficient vector without the intercept term, and $\beta_0$ is the intercept.

Furthermore we assume that $\beta_*$ has a normal prior:

$$ \beta_* \sim \mathcal{N}\left(0, \tau^2 I\right). $$

Here $\sigma^2$ and $\tau^2$ are fixed, which do not require prior specifications. If we assume an uninformative prior for $\beta_0$, the posterior distribution of $\beta$ given the data $(X, y)$ becomes

$$ \begin{aligned} p(\beta \given X, y) &\propto p(y\given X, \beta)\pi(\beta_*) \\ &\propto \text{exp}\left(-\frac{(y-\bs{1}\beta_0 - X_*\beta_*)\tr(y-\bs{1}\beta_0 - X_*\beta_*)}{2\sigma^2}\right) \times \text{exp}\left(-\frac{\beta\tr_*\beta_*}{2\tau^2} \right) \\ &\propto \text{exp}\left(-\frac{1}{2\sigma^2}\left\{(y-\bs{1}\beta_0 - X_*\beta_*)\tr(y-\bs{1}\beta_0 -X_*\beta_*) + \frac{\sigma^2}{\tau^2} \beta\tr_* \beta_*\right\}\right) \end{aligned} $$

Setting $\lambda = \sigma^2/\tau^2$, the negative log-likelihood of the posterior distribution is proportional to

$$ \begin{aligned} f(\beta_0, \beta_*) &:= \frac{1}{2}(Y-\bs{1}\beta_0 - X_*\beta_*)\tr(Y-\bs{1}\beta_0 - X_*\beta_*) + \frac{\lambda}{2}\beta\tr_*\beta_* \\ &= \frac{1}{2}\sum_{i=1}^n (y_i - \beta_0 - x_i\tr \beta_*)^2 +\frac{\lambda}{2} ||\beta_*||^2. \end{aligned} \tag{7} $$

From (7) we define the ridge estimator as

$$ \hat{\beta}_{ridge} = \argmin{\beta}{f(\beta_0, \beta_*)}. \tag{8} $$

Setting the derivative of (7) with respect to $\beta_0$ to 0 gives

$$ \begin{aligned} & \frac{df(\beta_0, \beta_*)}{d\beta_0} = \sum_{i=1}^n(y_i-\beta_0-x_i\tr \beta_*) \overset{set}{=} 0 \\ \implies & n\beta_0 = \sum_{i=1}^n y_i - \sum_{i=1}^n x_i\tr \beta_0 \\ \implies & \beta_0 = \bar{y} - \bar{x}\tr \beta_*, \end{aligned} \tag{9} $$

where $\bar{x} = \frac{1}{n}\sum_{i=1}^n x_i$ and $\bar{y} = \frac{1}{n} \sum_{i=1}^n y_i$. Substituting the result of (9) into (7), we can elimiate the bias term $\beta_0$, and obtain a new objective function

$$ \begin{aligned} g(\beta_*) &= f(\bar{y} - \bar{x}\tr \beta_*, \beta_*) \\ &= \frac{1}{2} \sum_{i=1}^n \left[(y_i-\bar{y}) - (x_i-\bar{x})\tr\beta_* \right]^2 + \frac{\lambda}{2}||\beta_*||^2. \end{aligned} \tag{10} $$

This shows that the ridge estimator is invariant to scaling to 0 mean. Henceforth we can assume that data preprocessing is done beforehand, so that all of $x_1,...,x_p$ and $y$ have mean of 0. Therefore, $X$ becomes a matrix with $p$ columns and $\beta = (\beta_1,...,\beta_p)'$ does not include the bias term. The objective function for ridge regression becomes

$$ RSS(\beta) = \frac{1}{2}(y-X\beta)\tr(y-X\beta) + \frac{\lambda}{2} \beta\tr \beta. \tag{11} $$

The objective function (11) can be expanded, and after discarding the terms that do not involve $\beta$, we obtain the objective function

$$ \begin{aligned} h(\beta) &= \frac{1}{2}\beta\tr X\tr X\beta - \beta\tr X\tr y + \frac{\lambda}{2} \beta \tr \beta \\ &= \frac{1}{2} \beta\tr (X\tr X + \lambda I)\beta - \beta\tr X\tr y. \end{aligned} $$

Setting the gradient of $h(\beta)$ to 0 yields

$$ \begin{aligned} & \nabla h(\beta) = (X\tr X + \lambda I)\beta - X\tr y \overset{set}{=} 0 \\ \implies & \hat{\beta}_{ridge} = (X\tr X + \lambda I)^{-1} X\tr y. \end{aligned}\tag{12} $$

Unlike the least square estimator, the ridge estimator is biased. Let's prove this. Let $S = X\tr X$ be the covariance matrix. Then

$$ \begin{aligned} \hat{\beta}_{ridge} &= (S+\lambda I)^{-1} X\tr y \\ &= (S+\lambda I)^{-1} S S^{-1}X\tr y \\ &= \left[S(I+\lambda S^{-1})\right]^{-1} S \left[(X\tr X)^{-1}X\tr y\right] \\ &= (I+\lambda S^{-1})^{-1}S^{-1} S \hat{\beta}_{OLS} \\ &= (I+\lambda (X\tr X)^{-1})\hat{\beta}_{OLS}. \end{aligned} $$

Taking the conditional expectation, we get

$$ \E{\hat{\beta}_{ridge} \given X} = (I + \lambda (X\tr X)^{-1}) \beta, $$

which does not equal to $\beta$ when $\lambda$ is nonzero. However it is asymptotically unbiased since $(X\tr X)^{-1}$ converges to the zero matrix for large $n$.

The covariance matrix of the ridge estimator can be found in a similar fashion as (4):

$$ \begin{aligned} \Sigma_{ridge} &= \text{Var}\left(\hat{\beta}_{ridge} \given X\right) \\ &= \text{Var}\left((X\tr X + \lambda I)^{-1}X\tr y \given X\right) \\ &= \left[(X\tr X + \lambda I)^{-1} X\tr\right] \text{Var}(y) \left[(X\tr X + \lambda I)^{-1} X\tr\right]\tr \\ &= \left[(X\tr X + \lambda I)^{-1} X\tr\right] \left(\sigma^2 I\right) \left[X(X\tr X + \lambda I)^{-1}\right] \\ &= \sigma^2\left(X\tr X + \lambda I\right)^{-1} X\tr X (X\tr X + \lambda I)^{-1} \\ &= \sigma^2 W (X\tr X)^{-1} W\tr, \end{aligned}\tag{13} $$

where $W = (X\tr X+\lambda I)^{-1} X\tr X$. (Note that $W^{-1} = I+\lambda(X\tr X)$.)

We compare this covariance matrix to the covariance matrix for the OLS estimator by observing the difference:

$$ \begin{aligned} \Sigma_{OLS} - \Sigma_{ridge} &= \sigma^2\left[(X\tr X)^{-1} - W(X\tr X)^{-1} W\tr\right] \\ &= \sigma^2W\left\{W^{-1} (X\tr X)^{-1} {W^{-1}}\tr - (X\tr X)^{-1} \right\}W\tr \\ &= \sigma^2W\left\{(I+\lambda(X\tr X)^{-1}) (X\tr X)^{-1} (I+\lambda(X\tr X)^{-1})\tr - (X\tr X)^{-1} \right\}W\tr \\ &= \sigma^2 W\left\{2\lambda(X\tr X)^{-2} + \lambda^2 (X\tr X)^{-1} \right\} W\tr \\ &= \sigma^2 (X\tr X+\lambda I)^{-1} \left(2\lambda I + \lambda^2 (X\tr X)^{-1}\right) {\left(X\tr X+\lambda I\right)^{-1}}\tr \end{aligned} $$

Since each component in the expression above is positive semi-definite, the resulting difference is also positive semi-definite. Hence we see that the ridge estimator is better than the OLS estimator in the same sense as in (3). Let's use the diabetes dataset as a simple illustration.

In [1]:
import numpy as np
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Ridge, RidgeCV
import matplotlib.pyplot as plt
In [2]:
diabetes = load_diabetes()
X = diabetes.data
y = diabetes.target
y = (y-y.mean())/y.std() # standardize
X = (X-X.mean(axis=0))/X.std(axis=0) # standardize

alphas = np.logspace(-1, 5, 200) # alpha as lambda
neg_log_alphas = -np.log10(alphas)

coefs = []
for a in alphas:
    ridge = Ridge(alpha=a, fit_intercept=False)
    ridge.fit(X, y)
    coefs.append(ridge.coef_)

# use cross validation to select alpha
reg = RidgeCV(cv=10, fit_intercept=False).fit(X, y)
alpha_cv = reg.alpha_
print("10-fold CV selects alpha =", alpha_cv)

plt.plot(neg_log_alphas, coefs)
plt.xlabel('-log(alpha)')
plt.title('Ridge Solution Paths')
plt.ylabel('coefficients')
plt.axvline(x=-np.log10(alpha_cv), linestyle='--')
plt.axis('tight')
plt.show()
10-fold CV selects alpha = 10.0

We see that as the penalty parameter $\lambda$ increases (-log(alpha) decreases), the coefficients get shrunken continuously toward 0.

The singular value decomposition (SVD) of the design matrix $X$ can provide much more insight into the shrinkage process. The SVD of $X$ has the form

$$ X = UDV\tr, $$

where the columns of $U$ (a $n\times p$ matrix) are the eigenvectors of $XX\tr$ that form an orthonormal basis for the column space of $X$, the columns of $V$ (a $p\times p$ matrix) are the eigenvectors of $X\tr X$ that form an orthonormal basis for the row space of $X$, and $D$ is a $p\times p$ diagonal matrix whose diagonal entries $d_1\geq d_2 \geq \cdots \geq d_p \geq 0$ are the eigenvalues of the covariance matrix $XX\tr$ (and also of $X\tr X$) arranged in decreasing order, also known as the singular values.

For $j=1,...,n$, let $x_j\tr$ denote the $j$th row of $X$. If we project the rows of $X$ onto an unit direction $v \in \mathbb{R}^p$ in the row space of $X$ so that each row $x_j\tr$ becomes $\left[(v v\tr) x_j\right]\tr = x_j\tr (vv\tr)$, the projected data matrix becomes

$$ Y = \begin{bmatrix} x_1\tr (vv\tr) \\ x_2\tr (vv\tr) \\ \vdots \\ x_n\tr (vv\tr) \end{bmatrix} = X v v\tr $$

The data covariance matrix of the transformed data $Y$ becomes

$$ \begin{aligned} Y Y\tr &= X v v\tr (X v v\tr)\tr \\ &= X v v\tr v v\tr X\tr \\ &= X v v\tr X\tr \\ &= Xv (Xv)\tr \end{aligned} $$

We see that as a result of the projection, the new covariance matrix becomes the outer product of a single derived variable $z = Xv$, with rank 1 (originally it had rank $p$). So which direction $v$ maximizes the total variance of the new derived variable $z = Xv$? To answer the above question, we need to solve

$$ v^* = \argmax{||v||=1}{ v\tr (X\tr X) v } \tag{14} $$

Theorem   The direction in the row space of $X$ that maximizes the variance of the projected data points is the first column $v_1$ of the matrix $V$ in the singular value decomposition $X= UDV\tr$.

Proof   Let $A=X\tr X$ denote the sample covariance matrix, and define $f(x)=x\tr A x$ be a function. Since $A$ is symmetric, there exists an orthogonal matrix $Q=[q_1,...,q_p]$ and a diagonal matrix $\Lambda=\text{diag}(\lambda_1,...,\lambda_n)$ such that

$$ A = Q\Lambda Q\tr. $$

Let $(x_i, \lambda_i)$ be an eigenvalue-eigenvector pair of the $A$. Then

$$ f(x_i) = x_i\tr A x_i = x_i\tr \lambda_i x_i = \lambda_i x_i\tr x_i = \lambda_i. $$

For any unit vector $x\in \mathbb{R}^p$, express it in terms of the columns of $Q$:

$$ x=c_1 q_1 + \cdots c_p q_p, $$

such that $c_1^2 + \cdots c_n^2 =1$ since $||x||=1$. Furthermore let $\lambda_{max}$ and $\lambda_{min}$ denote the maximum and minimum of the eigenvalues, respectively. Then,

$$ \begin{aligned} f(x) &= (c_1q_1 + \cdots + c_p q_p)\tr A(c_1q_1 + \cdots + c_p q_p) \\ &= c_1^2 q_1\tr Aq_1 + c_2^2 q_2\tr Aq_2 + \cdots c_n^2 q_p\tr Aq_p \\ &= c_1^2\lambda_1 + c_2^2\lambda_2 + \cdots c_n^2 \lambda_n \\ &\leq \lambda_{max}(c_1^2 + c_2^2 + \cdots c_n^2) \\ &=\lambda_{max}. \end{aligned} $$

Similarly we can show that $f(x)\geq \lambda_{min}$. Using the SVD of $X$, the sample covariance matrix can be written as

$$ X\tr X = (UDV\tr)\tr (UDV) = VD^2 V\tr, $$

which is precisely the eigen decomposition of $X\tr X$. The first column $v_1$ of $V$ is the eigenvector that corresponds to the maximum eigenvalue $d_1^2$ of $X\tr X$. $\blacksquare$

The columns of $V$ are so special that they are referred to as the principal component directions of $X$. We have already shown that the new derived vector $z_1 = Xv_1$ has the largest sample variance among all normalized linear combinations of the columns of $X$. This sample variance is

$$ \text{var}(z_1) = \frac{1}{n}z_1\tr z_1 = \frac{1}{n} v_1\tr X\tr X v_1 = \frac{1}{n}d_1^2. $$

Moreover, we can write

$$ z_1 = Xv_1 = UDV\tr v_1 = d_1u_1. $$

This derived new variable $z_1$ also has a name: It is the first principal component of $X$, whose normalized form is the first column of the matrix $U$. This is beautiful. Subsequent principal components $z_2,...,x_p$ have decreasing sample variance.

Let's put this in the ridge regression setting. Using the SVD of $X$, we can rewrite the ridge solution as

$$ \begin{aligned} X\hat{\beta}_{ridge} &= X(X\tr X+\lambda I)^{-1}X\tr y \\ &= UD(D^2+\lambda I)^{-1}DU\tr y \\ &= \sum_{j=1}^p u_j \frac{d_j^2}{d_j^2+\lambda} u_j\tr y. \end{aligned} $$

If $\lambda=0$, the Ridge solution is precisely $U U\tr y$ as the projection of $y$ onto the orthonormal columns of $U$. For $\lambda >0$, Ridge does something differently. It shrinks the projection onto the $j$th principal component by a factor of $d_j^2/(d_j^2+\lambda)$. The amount of shrinkage is inversely proportional to the amount of variance explained by the principal component!

Bayes Estimator and LASSO

In general let's assume that $y$ and $x_1,...,x_p$ are standardized, and for $q\geq0$ define a family of priors for $\beta_j$:

$$ \pi(\beta_j) \propto \text{exp}\left(-|\beta_j|^q/\tau\right). $$

The posterior density becomes

$$ \begin{aligned} p(\beta \given X, y) &\propto p(y\given X, \beta)\pi(\beta) \\ &\propto \text{exp}\left(-\frac{(y- X\beta)\tr(y-X\beta)}{2\sigma^2}\right) \times \text{exp}\left(-\frac{1}{\tau}\sum_{j=1}^p |\beta_j|^q \right) \\ &\propto \text{exp}\left(-\frac{1}{2\sigma^2}\left\{(y - X\beta)\tr(y-X\beta) + \frac{2\sigma^2}{\tau} \sum_{j=1}^p |\beta_j|^q\right\}\right) \end{aligned} $$

Let $\lambda=2\sigma^2/\tau$, we have arrived at the generalized Bayes estimator as the minimizer of the negative log-likelihood:

$$ \hat{\beta} = \argmin{\beta}{\left\{\sum_{i=1}^N (y_i - X\beta)^2 + \lambda \sum_{j=1}^p |\beta_j|^q\right\}.} $$

In particular if $q=2$, we have the ridge estimator. If $q=0$, we have the ordinary least squares estimator. If $q=1$, this is becomes the LASSO estimator. Note that the Bayes estimator is nothing but the posterior mode given a specific class of prior. Thus general Bayesian modeling can be much more flexible.

When $q=1$ we have the LASSO estimator:

$$ \hat{\beta}_{lasso} = \argmin{\beta}{\left\{\sum_{i=1}^N (y_i - X\beta)^2 + \lambda \sum_{j=1}^p |\beta_j|\right\}.} \tag{14} $$

The LASSO is different from Ridge in that it has the property that coefficients can be shrunk to 0 exactly. We can view (14) as the Lagrangian of the following primal optimization problem:

$$ \begin{aligned} &\argmin{\beta} \sum_{i=1}^N(y_i-X\beta)^2 \\ & \text{s.t. } \sum_{j=1}^p |\beta_j| \leq t. \end{aligned} $$

The feasible reason in this case is a "diamond", so that when the contours of the least squares error function hits one of the "corners", a coefficient becomes exactly zero. Let's visualize the LASSO using the same toy dataset.

In [63]:
from sklearn.linear_model import lasso_path, LassoCV
from itertools import cycle
In [86]:
eps = 1e-3 # length of path as alpha_min/alpha_max
alphas_lasso, coefs_lasso, _ = lasso_path(X, y, eps, fit_intercept=False)

# obtain standard color cycle
prop_cycle = plt.rcParams['axes.prop_cycle']
colors = cycle(prop_cycle.by_key()['color'])

# take negative log for better axis label
neg_log_alphas_lasso = -np.log10(alphas_lasso)
for coef_l, c in zip(coefs_lasso, colors):
    l1 = plt.plot(neg_log_alphas_lasso, coef_l, c=c)

# use cross validation to select alpha
reg = LassoCV(cv=10, fit_intercept=False).fit(X, y)
alpha_cv = reg.alpha_
print("10-fold CV selects alpha =", alpha_cv)

plt.xlabel('-log(alpha)')
plt.ylabel('coefficients')
plt.title('LASSO Solution Paths')
plt.axis('tight')
plt.axvline(x=-np.log10(alpha_cv), linestyle='--')
plt.show()
10-fold CV selects alpha = 0.015576630374856624