Self-adjoint Mappings - Part 1

June 18, 2019 $\newcommand{\bs}{\boldsymbol}$ $\newcommand{\argmin}[1]{\underset{\bs{#1}}{\text{arg min}}\,}$ $\newcommand{\argmax}[1]{\underset{\bs{#1}}{\text{arg max}}\,}$ $\newcommand{\tr}{^{\top}}$ $\newcommand{\norm}[1]{\left|\left|\,#1\,\right|\right|}$ $\newcommand{\given}{\,|\,}$ $\newcommand{\st}{\,\big|\,}$ $\newcommand{\E}[1]{\mathbb{E}\left[#1\right]}$ $\newcommand{\P}[1]{\mathbb{P}\left(#1\right)}$ $\newcommand{\abs}[1]{\left|#1\right|}$ $\newcommand{\blue}[1]{{\color{blue} {#1}}}$ $\newcommand{\red}[1]{{\color{red} {#1}}}$ $\newcommand{\orange}[1]{{\color{orange} {#1}}}$ $\newcommand{\pfrac}[2]{\frac{\partial #1}{\partial #2}}$

The idea of adjointness is what make linear algebra so beautiful. Given a linear map between inner product spaces $A: X\to U$, the adjoint map $A^*: U\to X$ is defined via Riesz representation to satisfy

$$ (x,A^* u) = (Ax,u), \tag{1} $$

where the first inner product is in $X$ and the second inner product is in $U$. Previously, we have shown that with respect to the standard Euclidean structure and an orthornormal basis, we can interprest $A$ and $A^*$ as matrices that are transposes of each other.

In this post, we examine a special type of linear maps. A linear map $A$ of Euclidean spaces into themselves is called self-adjoint if

$$ A^* = A. $$

In particular, when $A$ acts on a real Euclidean space with respect to an orthonormal basis, $A$ is called symmetric.

Orthogonal Projection

Orthogonal projection is one of the most important concepts in linear algebra.

Theorem 1   Let $X$ be an inner product space, and $Y$ be a subspace. The orthogonal projection $P_Y:X\to Y$ defined by $P_Y x = z$ where $x\in X, z\in Y$ and $x=z+z^{\perp}$ is self-adjoint.

Proof   Let $Y$ be a subspace of $X$. For all $x\in X$, its unique orthogonal decomposition is $x=z+z^{\perp}$, where $z\in Y$ and $z^{\perp}\in Y^{\perp}$. Take any $y\in Y$,

$$ (P_Y x,y) = (z,y) = (z+z^{\perp},y) =(x,P_Yy), $$

where the middle equality follows from the fact that $(z^{\perp},y)=0$. This proves that $P_Y^*=P_Y$. $\blacksquare$

As a special case, let $A$ be a $m\times n$ real matrix where $m >n$ and $A$ has only the trivial nullspace. The orthogonal projection onto $\text{col}(A)$ is

$$ P_{A} = A(A^* A)^{-1}A^*. \tag{2} $$

From the above it is evident that $P_A$ is self-adjoint. (As a side note, $P_A^2=P_A$ so it is also idempotent.) Now for any vector $p\in \mathbb{R}^m$, let $Az=P_A p$ to be the projection of $p$ onto $\text{col}(A)$. Then we have the following theorem.

Theorem 2   Among all elements in $\text{col}(A)$, the one closest in Enclidean distance to $p$ is $Az=P_Ap$, where $P_A$ is given by (2). That is, the vector $x$ that minimizes the distance $||Ax-p||^2$ is $z$. Furthermore, $Az-p$ is orthogonal to $\text{col}(A)$.

Proof   Cancel $A$ from both sides of $Az=P_Ap$ (possible since $A$ has the trivial nullspace) and obtain

$$ z = (A^*A)^{-1}A^*p \implies A^*Az=A^*p. \tag{3} $$

First we show that (3) has a unique solution. Since $A^*p \in \mathbb{R}^n$, we have a system of $n$ equations and $n$ unknowns. It suffices to show that $A^*A$ has a trivial nullspace. To prove that, suppose $A^*Ay=0$. Take an inner product with $y$ and use the property of adjointness, we have

$$ 0=(A^*Ay, y) = (Ay, Ay) = ||Ay||^2. $$

This true for all $y$, so $Ay=0$. Since $A$ has a trivial nullspace, we have that $y=0$. This also justifies the existence of the inverse $(A^*A)^{-1}$ in (2).

Now pick $z\in \mathbb{R}^n$ such that $Az-p$ is orthogonal to $\text{col}(A)$. Write $p=-Az+(Az-p)$ so that

$$ Ax-p = Ax-Az+(Az-p). $$

Since $Ax-Az = A(x-z)\in \text{col}(A)$, which is orthogonal to $Az-p$ by assumption, we use the Pythagorean theorem to conclude that

$$ ||Ax-p||^2 = ||A(x-z)||^2 + ||Az-p||^2. $$

This shows that $z$ minimizes the distance $||Ax-p||^2$. Finally, to find $z$, we use orthogonality and adjointness to conclude that

$$ (Az-p, Ay)=0 \implies (A^*(Az-p), y)=0. $$

This holds for all $y$, so $A^*(Az-p)=0$, which simplifies to (3). The proof is complete by existence and uniqueness. $\blacksquare$

The Hessian

Let $f(x)$ be a real-valued twice-differentiable function whose input $x=(x_1,...,x_n)$ has dimension $n$. The Taylor series approximation of second order has the form

$$ f(x+a)=f(x) + l(a) + \frac{1}{2}q(a) + ||a||^2 \epsilon(||a||), $$

where $\epsilon(||a||)\to 0$ as $a\to 0$, $l(a)$ is a linear function of $a$, which by Riesz representation theorem has the form

$$ l(a) = (a,y), \quad y\in \mathbb{R}^n. $$

From Taylor's theorem, we have that

$$ y = \nabla f(x), $$

is precisely the gradient vector evaluated at $x$. The function $q(a)$ has the quadratic form:

$$ q(a) = (a, Ha), \quad H \in M_{n\times n}(\mathbb{R}), $$

and $H$ is self-adjoint. Once again, from Taylor's theorem, $H$ is known as the Hessian of $f$, whose $ij$th entry is given by

$$ h_{ij} = \frac{\partial^2 f}{\partial x_j \partial x_i}(x). $$

The self-adjointness of $H$ follows from the fact that the mixed partials of twice-differentiable functions are equal: i.e., $h_{ij} = h_{ji}$. In addition, the quadratic form $(a, Ha)$ can be written as a sum:

$$ \begin{aligned} q(a) &= \sum_{i,j}h_{ij}a_i a_j \\ &= \sum_{k=1}^n h_{kk} a_k^2 + 2\sum_{1\leq i<j \leq n} h_{ij}a_i a_j. \end{aligned} $$

Can we express $q(a)$ simply as a sum of squares without the extra mixed terms? In matrix language: Can we diagonalize $H$?

Diagonalization Algorithm

We start with only two variables by letting $a=(a_1,a_2)'$ and a two-variable quadratic form:

$$ \begin{aligned} q(a) &= h_{11}a_1^2 + h_{22}a_2^2 + 2h_{12}a_1a_2 \\ &= h_{11}\left(a_1 + h_{11}^{-1}h_{12}a_2 \right)^2 - h_{11}^{-1}\left(h_{12}a_2\right)^2 +h_{22}a_2^2\\ &= h_{11}\left(a_1 + h_{11}^{-1}h_{12}a_2 \right)^2 + (h_{22}-h_{11}^{-1}h_{12}^2)a_2^2 \\ &:= d_{1} b_{1}^2 + d_{2} b_{2}^2, \end{aligned} $$

where $h_{11}\neq 0$, $b_1 = a_1+h_{11}^{-1}h_{12}a_2$, and $b_2=a_2$.

In matrix form, $b$ is related to $a$ via a change of basis:

$$ b=\begin{bmatrix} b_{1} \\ b_{2} \end{bmatrix} = \begin{bmatrix} 1 & h_{11}^{-1}h_{12} \\ 0 & 1 \end{bmatrix} \begin{bmatrix} a_1 \\ a_2 \end{bmatrix} = La. $$

The matrix $L$ diagonalizes $q$ so that in terms of the new variable $b$, $q$ is diagonal; that is,

$$ \begin{aligned} q(L^{-1} b) &= \sum d_i b_i^2 \\ &= (L^{-1}b, HL^{-1}b) \\ &= (b, L^{-1*}HL^{-1} b). \end{aligned} $$

The matrix $D:=L^{-1*}HL^{-1}$ is diagonal whose entries are the $d_i$'s. In other words, we have

$$ \begin{aligned} H = L^* D L. \end{aligned} $$

In general, if $a=(a_1,...,a_n)'$, and assume that $h_{11}$ is nonzero, we can rewrite $q(a)$ as

$$ \begin{aligned} q(a) &= h_{11}\left(a_1 + h_{11}^{-1}\sum_{j\neq 1} h_{1j}a_j \right)^2 - h_{11}^{-1}\left(\sum_{j\neq 1} h_{1j}a_j\right)^2 + \sum_{i,j\neq 1} h_{ij} a_ia_j\\ &:= h_{11} b_1^2 + q_2(a), \end{aligned} $$

where

$$ b_1 = a_1 + h_{11}^{-1}\sum_{j\neq 1} h_{1j}a_j, \tag{4} $$

and $q_2(a)$ is a quadratic form that does not depend on $a_1$. If we let $G$ to be the new $(n-1)\times (n-1)$ matrix by removing the first row and the first column of $H$, and let $g$ to denote the first row (as a column vector) of $H$ after removing the first component, then

$$ q_2(a) = (a, (G - h_{11}^{-1}g g\tr)a). $$

Equation (4) generates a change of basis matrix $L$ which is constructed by replaced the first row of a $n\times n$ identity matrix with the linear combination:

$$ L = \begin{bmatrix} 1 & h_{11}^{-1}h_{12} & \cdots & h_{11}^{-1}h_{1n} \\ 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 \end{bmatrix}. \tag{5} $$

If $h_{11}$ is zero, and suppose $h_{kk}$ is the first nonzero diagonal element. We can switch variables $a_1$ and $a_k$ using a permutation matrix $P$. The matrix $H$ is changed by first switching the first and the $k$th row, then switching the first and the $k$th column. That is,

$$ H \leftarrow P^{-1*}HP^{-1}. $$

The resulting change of basis matrix becomes $LP$.

If all of the diagonal entries of $H$ is zero, but there exists some non-zero off-diagonal terms, say, $h_{kl}=h_{lk}\neq 0$, $k < l$. We replace $a_k$ and $a_l$ with $a_k+a_l$ and $a_l-a_k$. This produces a non-zero diagonal term since

$$ \frac{1}{2}h_{kl}(a_k+a_l)^2 - \frac{1}{2}h_{kl}(a_l-a_k)^2 = 2k_{kl}a_ka_l. $$

This transformation can be represented by a matrix $C$, which is constructed by starting from a $n\times n$ identity matrix, and setting

$$ C_{kl} \leftarrow 1 \, \text{ and } \, C_{lk} \leftarrow -1. \tag{6} $$

The matrix $H$ is then transformed by

$$ H \leftarrow C^{-1*}HC^{-1}. $$

The new permutation matrix becomes $LC$. This concludes this beautiful algorithm! See below for its implementation in R.

In [33]:
# reduce H using eqn (4)
reduce <- function(H) {
    H[-1,-1]-H[1,-1]%*%t(H[1,-1])/H[1,1]
}
In [34]:
# transform H via a change of basis
transform <- function(H,P) {
    P_inv <- solve(P)
    t(P_inv)%*%H%*%P_inv
}
In [2]:
# permutation matrix P for switching 1st and kth row/col
get_P <- function(k,n) {
    P <- diag(n)
    P[c(1,k),] <- P[c(k,1),]
    return(P)
}
In [66]:
# obtain L in eqn (5)
get_L <- function(H,i,n) {
    L <- diag(n)
    if(all(H==0)) {
        L[i,] <- rep(0,n) # return 0 combination
        return(L)
    }
    L[i,(i+1):n] <- H[1,-1]/H[1,1]
    return(L)
}
In [67]:
# find the first nonzero diagonal index k
get_k <- function(H) {
    h <- diag(H)
    if(any(is.na(match(h,0)))) {
        return(which(h!=0)[1])
    }
    return(0)
}
In [5]:
# in case all diagonal entries are 0, find nonzero h_kl
get_kl <- function(H) {
    m <- dim(H)[1] # current dim of H
    p_index <- min(which(H!=0))
    k <- p_index%%m
    l <- ceiling(p_index/m)
    return(c(k,l))
}
In [6]:
# obtain C in eqn (6)
get_C <- function(k,l,n) {
    C <- diag(n)
    C[k,l] <- 1
    C[l,k] <- -1
    return(C)
}
In [68]:
# main algorithm that diagonalizes any real symmetric matrix H
diagonalize <- function(H) {
    n <- dim(H)[1]
    L <- diag(n)
    D <- rep(0,n)
    for(i in 1:(n-2)){
        if(all(H==0)) {
            L <- get_L(H,i,n)%*%L
            D[i] <- 0
            H <- H[-1,-1]
            next
        }
        k <- get_k(H)
        if(k==0) {
            kl <- get_kl(H)
            k <- kl[1]
            l <- kl[2]
            C <- get_C(k,l,n)
            H <- transform(H,C)
            k <- get_k(H)
            if(k==1) {
                L <- get_L(H,i,n)%*%C%*%L
                D[i] <- H[1,1]
                H <- reduce(H)
            } else {
                P <- get_P(k,n)
                H <- transform(H,P)
                L <- get_L(H,i,n)%*%C%*%P%*%L
                D[i] <- H[1,1]
                H <- reduce(H)
            }
        } else if(k==1) {
            L <- get_L(H,i,n)%*%L
            D[i] <- H[1,1]
            H <- reduce(H)
        } else {
            P <- get_P(k,n)
            H <- transform(H,P)
            L <- get_L(H,i,n)%*%P%*%L
            D[i] <- H[1,1]
            H <- reduce(H)
        }
    }
    if(all(H==0)) {
        L <- get_L(H[-1,-1],n,n)%*%get_L(H,n-1,n)%*%L
        D[n-1] <- 0
        D[n] <- 0
        return(list('D'=diag(D), 'L'=L))
    }
    k <- get_k(H)
    if(k==1) {
        L <- get_L(H,n-1,n)%*%L
        D[n-1] <- H[1,1]
        D[n] <- H[2,2]-H[1,2]^2/H[1,1]
    } else {
        P <- get_P(k,n)
        H <- transform(H,P)
        L <- get_L(H,n-1,n)%*%P%*%L
        D[n-1] <- H[1,1]
        D[n] <- H[2,2]-H[1,2]^2/H[1,1]
    }
    return(list('D'=diag(D), 'L'=L))
}

That was a fun exercise! For testing, we generate a 3 by 3 matrix in all three cases: 1. all diagonal entries are nonzero, 2. some diagonal entries are zero, 3. all diagonal entries are zero, and 4. the zero matrix.

In [82]:
set.seed(1)
A <- matrix(rnorm(9),3,3)
H1 <- H2 <- H3 <- t(A)%*%A

print('Case 1'); print(round(H1,3))
print('--------------------')
print('Case 2'); H2[1,1] <- 0; print(round(H2,3))
print('--------------------')
print('Case 3'); diag(H3) <- 0; print(round(H3,3))
print('--------------------')
print('Case 4'); H4 <- matrix(0,3,3); print(round(H4,3))
[1] "Case 1"
       [,1]   [,2]   [,3]
[1,]  1.124 -0.253 -0.651
[2,] -0.253  3.327  0.548
[3,] -0.651  0.548  1.114
[1] "--------------------"
[1] "Case 2"
       [,1]   [,2]   [,3]
[1,]  0.000 -0.253 -0.651
[2,] -0.253  3.327  0.548
[3,] -0.651  0.548  1.114
[1] "--------------------"
[1] "Case 3"
       [,1]   [,2]   [,3]
[1,]  0.000 -0.253 -0.651
[2,] -0.253  0.000  0.548
[3,] -0.651  0.548  0.000
[1] "--------------------"
[1] "Case 4"
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
[3,]    0    0    0
In [83]:
R1 <- diagonalize(H1); L1 <- R1$L; D1 <- R1$D
R2 <- diagonalize(H2); L2 <- R2$L; D2 <- R2$D
R3 <- diagonalize(H3); L3 <- R3$L; D3 <- R3$D
R4 <- diagonalize(H4); L4 <- R4$L; D4 <- R4$D

print('Case 1 Results:'); print(round(t(L1)%*%D1%*%L1,3))
print('--------------------')
print('Case 2 Results:'); print(round(t(L2)%*%D2%*%L2,3))
print('--------------------')
print('Case 3 Results:'); print(round(t(L3)%*%D3%*%L3,3))
print('--------------------')
print('Case 4 Results:'); print(round(t(L4)%*%D4%*%L4,3))
[1] "Case 1 Results:"
       [,1]   [,2]   [,3]
[1,]  1.124 -0.253 -0.651
[2,] -0.253  3.327  0.548
[3,] -0.651  0.548  1.114
[1] "--------------------"
[1] "Case 2 Results:"
       [,1]   [,2]   [,3]
[1,]  0.000 -0.253 -0.651
[2,] -0.253  3.327  0.548
[3,] -0.651  0.548  1.114
[1] "--------------------"
[1] "Case 3 Results:"
       [,1]   [,2]   [,3]
[1,]  0.000 -0.253 -0.651
[2,] -0.253  0.000  0.548
[3,] -0.651  0.548  0.000
[1] "--------------------"
[1] "Case 4 Results:"
     [,1] [,2] [,3]
[1,]    0    0    0
[2,]    0    0    0
[3,]    0    0    0