June 18, 2019 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\newcommand{\st}{\,\big|\,}$ $\newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$ $\newcommand{\P}[1]{\mathbb{P}\left(#1\right)}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

The idea of adjointness is what make linear algebra so beautiful. Given a linear map between inner product spaces $A: X\to U$, the adjoint map $A^*: U\to X$ is defined via Riesz representation to satisfy

$$ (x,A^* u) = (Ax,u), \tag{1} $$

where the first inner product is in $X$ and the second inner product is in $U$. Previously, we have shown that with respect to the standard Euclidean structure and an orthornormal basis, we can interprest $A$ and $A^*$ as matrices that are transposes of each other.

In this post, we examine a special type of linear maps. A linear map $A$ of Euclidean spaces into themselves is called **self-adjoint** if

$$ A^* = A. $$

In particular, when $A$ acts on a real Euclidean space with respect to an orthonormal basis, $A$ is called **symmetric**.

Orthogonal projection is one of the most important concepts in linear algebra.

Theorem 1Let $X$ be an inner product space, and $Y$ be a subspace. Theorthogonal projection$P_Y:X\to Y$ defined by $P_Y x = z$ where $x\in X, z\in Y$ and $x=z+z^{\perp}$ is self-adjoint.

*Proof* Let $Y$ be a subspace of $X$. For all $x\in X$, its unique orthogonal decomposition is $x=z+z^{\perp}$, where $z\in Y$ and $z^{\perp}\in Y^{\perp}$. Take any $y\in Y$,

$$ (P_Y x,y) = (z,y) = (z+z^{\perp},y) =(x,P_Yy), $$

where the middle equality follows from the fact that $(z^{\perp},y)=0$. This proves that $P_Y^*=P_Y$. $\blacksquare$

As a special case, let $A$ be a $m\times n$ real matrix where $m >n$ and $A$ has only the trivial nullspace. The orthogonal projection onto $\text{col}(A)$ is

$$ P_{A} = A(A^* A)^{-1}A^*. \tag{2} $$

From the above it is evident that $P_A$ is self-adjoint. (As a side note, $P_A^2=P_A$ so it is also *idempotent*.) Now for any vector $p\in \mathbb{R}^m$, let $Az=P_A p$ to be the projection of $p$ onto $\text{col}(A)$. Then we have the following theorem.

Theorem 2Among all elements in $\text{col}(A)$, the one closest in Enclidean distance to $p$ is $Az=P_Ap$, where $P_A$ is given by (2). That is, the vector $x$ that minimizes the distance $||Ax-p||^2$ is $z$. Furthermore, $Az-p$ is orthogonal to $\text{col}(A)$.

*Proof* Cancel $A$ from both sides of $Az=P_Ap$ (possible since $A$ has the trivial nullspace) and obtain

$$ z = (A^*A)^{-1}A^*p \implies A^*Az=A^*p. \tag{3} $$

First we show that (3) has a unique solution. Since $A^*p \in \mathbb{R}^n$, we have a system of $n$ equations and $n$ unknowns. It suffices to show that $A^*A$ has a trivial nullspace. To prove that, suppose $A^*Ay=0$. Take an inner product with $y$ and use the property of adjointness, we have

$$ 0=(A^*Ay, y) = (Ay, Ay) = ||Ay||^2. $$

This true for all $y$, so $Ay=0$. Since $A$ has a trivial nullspace, we have that $y=0$. This also justifies the existence of the inverse $(A^*A)^{-1}$ in (2).

Now pick $z\in \mathbb{R}^n$ such that $Az-p$ is orthogonal to $\text{col}(A)$. Write $p=-Az+(Az-p)$ so that

$$ Ax-p = Ax-Az+(Az-p). $$

Since $Ax-Az = A(x-z)\in \text{col}(A)$, which is orthogonal to $Az-p$ by assumption, we use the Pythagorean theorem to conclude that

$$ ||Ax-p||^2 = ||A(x-z)||^2 + ||Az-p||^2. $$

This shows that $z$ minimizes the distance $||Ax-p||^2$. Finally, to find $z$, we use orthogonality and adjointness to conclude that

$$ (Az-p, Ay)=0 \implies (A^*(Az-p), y)=0. $$

This holds for all $y$, so $A^*(Az-p)=0$, which simplifies to (3). The proof is complete by existence and uniqueness. $\blacksquare$

Let $f(x)$ be a real-valued twice-differentiable function whose input $x=(x_1,...,x_n)$ has dimension $n$. The Taylor series approximation of second order has the form

$$ f(x+a)=f(x) + l(a) + \frac{1}{2}q(a) + ||a||^2 \epsilon(||a||), $$

where $\epsilon(||a||)\to 0$ as $a\to 0$, $l(a)$ is a linear function of $a$, which by Riesz representation theorem has the form

$$ l(a) = (a,y), \quad y\in \mathbb{R}^n. $$

From Taylor's theorem, we have that

$$ y = \nabla f(x), $$

is precisely the gradient vector evaluated at $x$. The function $q(a)$ has the **quadratic form**:

$$ q(a) = (a, Ha), \quad H \in M_{n\times n}(\mathbb{R}), $$

and $H$ is self-adjoint. Once again, from Taylor's theorem, $H$ is known as the **Hessian** of $f$, whose $ij$th entry is given by

$$ h_{ij} = \frac{\partial^2 f}{\partial x_j \partial x_i}(x). $$

The self-adjointness of $H$ follows from the fact that the mixed partials of twice-differentiable functions are equal: i.e., $h_{ij} = h_{ji}$. In addition, the quadratic form $(a, Ha)$ can be written as a sum:

$$ \begin{aligned} q(a) &= \sum_{i,j}h_{ij}a_i a_j \\ &= \sum_{k=1}^n h_{kk} a_k^2 + 2\sum_{1\leq i<j \leq n} h_{ij}a_i a_j. \end{aligned} $$

Can we express $q(a)$ simply as a sum of squares without the extra mixed terms? In matrix language: Can we

diagonalize$H$?

We start with only two variables by letting $a=(a_1,a_2)'$ and a two-variable quadratic form:

$$ \begin{aligned} q(a) &= h_{11}a_1^2 + h_{22}a_2^2 + 2h_{12}a_1a_2 \\ &= h_{11}\left(a_1 + h_{11}^{-1}h_{12}a_2 \right)^2 - h_{11}^{-1}\left(h_{12}a_2\right)^2 +h_{22}a_2^2\\ &= h_{11}\left(a_1 + h_{11}^{-1}h_{12}a_2 \right)^2 + (h_{22}-h_{11}^{-1}h_{12}^2)a_2^2 \\ &:= d_{1} b_{1}^2 + d_{2} b_{2}^2, \end{aligned} $$

where $h_{11}\neq 0$, $b_1 = a_1+h_{11}^{-1}h_{12}a_2$, and $b_2=a_2$.

In matrix form, $b$ is related to $a$ via a change of basis:

$$ b=\begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix} = \begin{bmatrix} 1 & h_{11}^{-1}h_{12} \\ 0 & 1 \end{bmatrix} \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} = La. $$

The matrix $L$ diagonalizes $q$ so that in terms of the new variable $b$, $q$ is diagonal; that is,

$$ \begin{aligned} q(L^{-1} b) &= \sum d_i b_i^2 \\ &= (L^{-1}b, HL^{-1}b) \\ &= (b, L^{-1*}HL^{-1} b). \end{aligned} $$

The matrix $D:=L^{-1*}HL^{-1}$ is diagonal whose entries are the $d_i$'s. In other words, we have

$$ \begin{aligned} H = L^* D L. \end{aligned} $$

In general, if $a=(a_1,...,a_n)'$, and assume that $h_{11}$ is nonzero, we can rewrite $q(a)$ as

$$ \begin{aligned} q(a) &= h_{11}\left(a_1 + h_{11}^{-1}\sum_{j\neq 1} h_{1j}a_j \right)^2 - h_{11}^{-1}\left(\sum_{j\neq 1} h_{1j}a_j\right)^2 + \sum_{i,j\neq 1} h_{ij} a_ia_j\\ &:= h_{11} b_1^2 + q_2(a), \end{aligned} $$

where

$$ b_1 = a_1 + h_{11}^{-1}\sum_{j\neq 1} h_{1j}a_j, \tag{4} $$

and $q_2(a)$ is a quadratic form that does not depend on $a_1$. If we let $G$ to be the new $(n-1)\times (n-1)$ matrix by removing the first row and the first column of $H$, and let $g$ to denote the first row (as a column vector) of $H$ after removing the first component, then

$$ q_2(a) = (a, (G - h_{11}^{-1}g g\tr)a). $$

Equation (4) generates a change of basis matrix $L$ which is constructed by replaced the first row of a $n\times n$ identity matrix with the linear combination:

$$ L = \begin{bmatrix} 1 & h_{11}^{-1}h_{12} & \cdots & h_{11}^{-1}h_{1n} \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix}. \tag{5} $$

If $h_{11}$ is zero, and suppose $h_{kk}$ is the first nonzero diagonal element. We can switch variables $a_1$ and $a_k$ using a permutation matrix $P$. The matrix $H$ is changed by first switching the first and the $k$th row, then switching the first and the $k$th column. That is,

$$ H \leftarrow P^{-1*}HP^{-1}. $$

The resulting change of basis matrix becomes $LP$.

If all of the diagonal entries of $H$ is zero, but there exists some non-zero off-diagonal terms, say, $h_{kl}=h_{lk}\neq 0$, $k < l$. We replace $a_k$ and $a_l$ with $a_k+a_l$ and $a_l-a_k$. This produces a non-zero diagonal term since

$$ \frac{1}{2}h_{kl}(a_k+a_l)^2 - \frac{1}{2}h_{kl}(a_l-a_k)^2 = 2k_{kl}a_ka_l. $$

This transformation can be represented by a matrix $C$, which is constructed by starting from a $n\times n$ identity matrix, and setting

$$ C_{kl} \leftarrow 1 \, \text{ and } \, C_{lk} \leftarrow -1. \tag{6} $$

The matrix $H$ is then transformed by

$$ H \leftarrow C^{-1*}HC^{-1}. $$

The new permutation matrix becomes $LC$. This concludes this beautiful algorithm! See below for its implementation in `R`

.

In [33]:

```
# reduce H using eqn (4)
reduce <- function(H) {
H[-1,-1]-H[1,-1]%*%t(H[1,-1])/H[1,1]
}
```

In [34]:

```
# transform H via a change of basis
transform <- function(H,P) {
P_inv <- solve(P)
t(P_inv)%*%H%*%P_inv
}
```

In [2]:

```
# permutation matrix P for switching 1st and kth row/col
get_P <- function(k,n) {
P <- diag(n)
P[c(1,k),] <- P[c(k,1),]
return(P)
}
```

In [66]:

```
# obtain L in eqn (5)
get_L <- function(H,i,n) {
L <- diag(n)
if(all(H==0)) {
L[i,] <- rep(0,n) # return 0 combination
return(L)
}
L[i,(i+1):n] <- H[1,-1]/H[1,1]
return(L)
}
```

In [67]:

```
# find the first nonzero diagonal index k
get_k <- function(H) {
h <- diag(H)
if(any(is.na(match(h,0)))) {
return(which(h!=0)[1])
}
return(0)
}
```

In [5]:

```
# in case all diagonal entries are 0, find nonzero h_kl
get_kl <- function(H) {
m <- dim(H)[1] # current dim of H
p_index <- min(which(H!=0))
k <- p_index%%m
l <- ceiling(p_index/m)
return(c(k,l))
}
```

In [6]:

```
# obtain C in eqn (6)
get_C <- function(k,l,n) {
C <- diag(n)
C[k,l] <- 1
C[l,k] <- -1
return(C)
}
```

In [68]:

```
# main algorithm that diagonalizes any real symmetric matrix H
diagonalize <- function(H) {
n <- dim(H)[1]
L <- diag(n)
D <- rep(0,n)
for(i in 1:(n-2)){
if(all(H==0)) {
L <- get_L(H,i,n)%*%L
D[i] <- 0
H <- H[-1,-1]
next
}
k <- get_k(H)
if(k==0) {
kl <- get_kl(H)
k <- kl[1]
l <- kl[2]
C <- get_C(k,l,n)
H <- transform(H,C)
k <- get_k(H)
if(k==1) {
L <- get_L(H,i,n)%*%C%*%L
D[i] <- H[1,1]
H <- reduce(H)
} else {
P <- get_P(k,n)
H <- transform(H,P)
L <- get_L(H,i,n)%*%C%*%P%*%L
D[i] <- H[1,1]
H <- reduce(H)
}
} else if(k==1) {
L <- get_L(H,i,n)%*%L
D[i] <- H[1,1]
H <- reduce(H)
} else {
P <- get_P(k,n)
H <- transform(H,P)
L <- get_L(H,i,n)%*%P%*%L
D[i] <- H[1,1]
H <- reduce(H)
}
}
if(all(H==0)) {
L <- get_L(H[-1,-1],n,n)%*%get_L(H,n-1,n)%*%L
D[n-1] <- 0
D[n] <- 0
return(list('D'=diag(D), 'L'=L))
}
k <- get_k(H)
if(k==1) {
L <- get_L(H,n-1,n)%*%L
D[n-1] <- H[1,1]
D[n] <- H[2,2]-H[1,2]^2/H[1,1]
} else {
P <- get_P(k,n)
H <- transform(H,P)
L <- get_L(H,n-1,n)%*%P%*%L
D[n-1] <- H[1,1]
D[n] <- H[2,2]-H[1,2]^2/H[1,1]
}
return(list('D'=diag(D), 'L'=L))
}
```

That was a fun exercise! For testing, we generate a 3 by 3 matrix in all three cases: 1. all diagonal entries are nonzero, 2. some diagonal entries are zero, 3. all diagonal entries are zero, and 4. the zero matrix.

In [82]:

```
set.seed(1)
A <- matrix(rnorm(9),3,3)
H1 <- H2 <- H3 <- t(A)%*%A
print('Case 1'); print(round(H1,3))
print('--------------------')
print('Case 2'); H2[1,1] <- 0; print(round(H2,3))
print('--------------------')
print('Case 3'); diag(H3) <- 0; print(round(H3,3))
print('--------------------')
print('Case 4'); H4 <- matrix(0,3,3); print(round(H4,3))
```

In [83]:

```
R1 <- diagonalize(H1); L1 <- R1$L; D1 <- R1$D
R2 <- diagonalize(H2); L2 <- R2$L; D2 <- R2$D
R3 <- diagonalize(H3); L3 <- R3$L; D3 <- R3$D
R4 <- diagonalize(H4); L4 <- R4$L; D4 <- R4$D
print('Case 1 Results:'); print(round(t(L1)%*%D1%*%L1,3))
print('--------------------')
print('Case 2 Results:'); print(round(t(L2)%*%D2%*%L2,3))
print('--------------------')
print('Case 3 Results:'); print(round(t(L3)%*%D3%*%L3,3))
print('--------------------')
print('Case 4 Results:'); print(round(t(L4)%*%D4%*%L4,3))
```