Singular Value Decomposition

Singular Value Decomposition (SVD) is widely used in many applications such as Data Reduction , Data-Driven Generation of Fourier transform (FFT), Recommendation Algorithms, Facial Recognition and also can derive other algorithms like PCA.

Derivation

We give a data matrix consisting of a bunch of column vectors ${x_k} \in \mathbb{R}^n$ for $k=\{1, 2, .., m\}$ as below, where $x_k$ can be a vector summarizing the facial characteristics of a person $k$ or a vector describing what a snapshot looks like at time $k$. \[ X = \begin{bmatrix} | & | & ... & | \\ x_1 & x_2 & ... & x_m \\ | & | & ... & | \\ \end{bmatrix} \] And what SVD does is decompose the above data matrix into a product of three other matrices \[ X = U \Sigma V^T \] where left singular vectors $U$ and right singular vectors $V$ are unitary matrices, and $\Sigma$ is a diagonal matrix. Specifically, we have \[ \begin{align} X &= \begin{bmatrix} | & | & ... & | \\ u_1 & u_2 & ... & u_n \\ | & | & ... & | \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & & & \\ & \sigma_2 & & \\ & & ... & \\ & & & \sigma_m \\ & & \hline\huge 0 & \\ & & & \\ \end{bmatrix} \begin{bmatrix} - & v_1^T & - \\ - & v_2^T & - \\ & ... & \\ - & v_m^T & - \\ \end{bmatrix} \\ \end{align} \] where unitary matrices means $U U^T = U^T U = I_{n \times n}$ and $V V^T = V^T V = I_{m \times m}$.

The elements $\sigma_k$ in $\Sigma$ are so-called singular values, and they are ordered like $\sigma_1 \geq \sigma_2 \geq ... \geq \sigma_m$ by importance. At the same time, the importance is also reflected in $U$ and $V$ where $u_1$ is more important than $u_2$ in representing $n$ rows of $X$ because of their corresponding singular values $\sigma_1$ and $\sigma_2$.

Intuitively, if $x_1$ represents one person's face, we can interpret $U$ as "eigen" faces which will be reshaped into those original faces, while the column vectors of $V^T$ along with singular values $\sigma$ serve as the reshaping factor for all $u_1, u_2, ..., u_m$ to make $x_1, x_2, ..., x_m$.

And SVD can be done to any matrices and $U$, $\Sigma$, $V$ are unique and guaranteed to exist.

Matrix Approximation

Let's do introduce another form of SVD assuming $n >> m$ when there are a lot more measurements/observations $$ \[\begin{align} X &= \begin{bmatrix} | & | & ... & | \\ u_1 & u_2 & ... & u_n \\ | & | & ... & | \\ \end{bmatrix} \begin{bmatrix} \sigma_1 & & & \\ & \sigma_2 & & \\ & & ... & \\ & & & \sigma_m \\ & & \hline\huge 0 & \\ & & & \\ \end{bmatrix} \begin{bmatrix} - & v_1^T & - \\ - & v_2^T & - \\ & ... & \\ - & v_m^T & - \\ \end{bmatrix} \\ &= \sigma_1 u_1 v_1^T + \sigma_2 u_2 v_2^T + ... + \sigma_m u_m v_m^T \\ &= \hat U \hat \Sigma {\hat V} ^T\\ \end{align}\] $$ This is usually called economy SVD.

Recall that SVD is expressed as decomposing $X$ into 2 sets of orthogonal bases ($u_1, u_2,..., u_m$ and $v_1, v_2,..., v_m$) and a singular values diagonal matrix. With the economy SVD formula, we can tell SVD can also be expressed as a sum of a series of rank-1 matrices (rank-1 matrices means $\sigma_k u_k v_k^T$). This is called the dyadic decomposition of $A$, which decomposes the matrix $A$ of rank $r$ into sum of $r$ matrices of rank 1.

Naturally, as more rank-1 matrices are summed up, the summation increasingly improves the approximation of $X$. Hence, if we truncate $\hat U, \hat \Sigma, \hat V$ at rank $r$ and chop the summation off to only keep the first $r$ columns of $\hat U$, a $r \times r$ submatrix of $\hat \Sigma$ and the first $r$ rows of $\hat V^T$, then we have an rank-r matrix \[ X \approx \tilde U \tilde \Sigma \tilde V^T \] We can conclude $\tilde U \tilde \Sigma \tilde V^T$ is the best estimator of $X$ with rank $r$. Specifically, this conclusion is given by Eckart-Young Theorem \[ \underset{\tilde x \space s.t. \space rank(\tilde x) = r}{\operatorname{argmin}} || X - \tilde X||_F = \tilde U \tilde \Sigma \tilde V^T \] where $F$ represents the Frobenius Norm for matrices ($||A||_F = \sqrt{\sum_{i, j}(A_{i,j})^2}$). It guarantees the best possible matrix approximation to $X$ at rank $r$ is given by the first $r$ truncated SVD. Note: Since truncated $U$ and $V$ are not square matrices anymore, $\tilde U \tilde U^T = I$ and $\tilde V \tilde V^T = I$ does not hold true anymore.

SVD in Dominant Correlations

If we compute the covariance matrix of $X$, we can get \[ X^T X = V \hat \Sigma \hat U^T \hat U \hat \Sigma V^T = V \hat \Sigma^2 V^T \] which follows the eigen-decomposition form of $X^T X V = V \Sigma^2$. Therefore, the right singular vectors $V$ become the eigenvectors of the covariance matrix of column-wise matrix $X$. And the squared singular values make up the eigenvalues matrix $\Sigma^2$ .

If we do the same thing to another covariance matrix, we have \[ X X^T = \hat U \hat \Sigma V^T V \hat \Sigma \hat U^T = \hat U \hat \Sigma^2 \hat U^T \] Similarly, the left singular vectors $U$ become the eigenvectors of the covariance matrix of row-wise matrix $X^T$. This covariance matrix calculates covariance among observations. And the eigenvalues matrix is the same as above.

Something about $\hat U$ and $\hat V$

Sometimes when $X$ represents a huge dataset, one issue here is that $\hat U$ is hard to compute because $X X^T$ is a really huge matrix. But we can use the property that $X X^T$ and $X X^T$ shared the same eigenvalues matrix $\Sigma$ to compute $\hat U = X V \hat \Sigma^{-1}$.

Unitary Transformations

Derived from $X = \hat U \hat \Sigma V^T$, we have $X V = \hat U \hat \Sigma$ or $\hat U^T X =\hat \Sigma V^T$ which explains the unitary transformations between two eigen bases. $m$ dimensional eigen bases $V$ multiplying by $X$ are mapped into $n$ dimensional eigen bases $\hat U$. And $\hat U$ can also be mapped into $V$.

SVD in Solving Linear Systems of Equations

Below is a super normal linear system \[ A x = b \] where $A$ is a known $n \times m$ matrix, $b$ is a known $n \times 1$ vector and $x$ is an unknown $m \times 1$ vector.

When $n < m$, the linear system is underdetermined as below \[ \begin{bmatrix} | & | & ... & | \\ A_1 & A_2 & ... & A_m \\ | & | & ... & | \\ \end{bmatrix} \begin{bmatrix} x_1\\ x_2 \\ x_3\\ x_4 \\ ... \\ x_m\\ \end{bmatrix} = \begin{bmatrix} b_1\\ b_2 \\ ... \\ b_n\\ \end{bmatrix} \] and there are many solutions $x$ given $b$ because there is much more freedom among $x$ than among $b$ to uniquely determine the values of $x$.

When $n > m$, the linear system is overdetermined as below \[ \begin{bmatrix} | & | & ... & | \\ | & | & ... & | \\ A_1 & A_2 & ... & A_m \\ | & | & ... & | \\ | & | & ... & | \\ \end{bmatrix} \begin{bmatrix} x_1\\ ... \\x_m\\\end{bmatrix} = \begin{bmatrix} b_1\\ b_2 \\ b_3 \\ ... \\b_n\\\end{bmatrix} \] and there are generally zero solutions $x$ given $b$ because there is not enough freedom for $x$ to satisfy all these $b$ constrictions.

What SVD allows us to do is approximately inverts $A$ to compute so-called pseudo inverse matrix $A^{\dagger}$ to give a best $x$ as the solution to the above linear system when $n \neq m$. Below is the solution \[ x = A^{\dagger} b \] where $A^{\dagger} = V \hat \Sigma^{-1} \hat U^T$ is referred to as Moore–Penrose (left) pseudo inverse derived from $A = \hat U \hat \Sigma V^T$. And this pseudo matrix $A^{\dagger}$ has these optimal properties under different circumstances as below

$A^{\dagger}$ has the minimum 2-norm $min ||x||_2, s.t. Ax = b$ among all possible solutions when the linear system is underdetermined.
$A^{\dagger}$ minimizes the error $min||Ax - b||_2, s.t.Ax=b$ when the linear system is overdetermined, which is also the most common OLS.

Least Squares Regression and SVD

Recall that if there exists a solution to the linear system $Ax = b$, it basically means $b$ is within the span of the columns of $A$, considering $Ax$ simply means the linear combinations of column vectors of $A$. Now let's look at new linear combinations obtained by SVD as below \[ A \tilde x = A A^{\dagger} b = \hat U \hat \Sigma V^T V \hat \Sigma^{-1} \hat U^T b = \hat U \hat U^T b \] Even though the estimate $\tilde x$ does not equal to $b$ (because truncated $\hat U$ makes $\hat U \hat U^T \neq I$), it is still the best least square solution projecting $b$ onto the span of the columns of $U$ which is also the span of the columns of $A$.

Now consider a one-dimensional example $ax = b$, where $a$ and $b$ are $n \times 1$ vector. If there is a unique solution to this equation, then $(a, b)$ constitutes a straight line and $x$ is then the slope of this line. But the often case is that these $(a, b)$ dots do not lie on the same line, thus making SVD useful to find a best possible line to fit these $(a, b)$ dots.

As the above graph shows, the true line represents $ax = b$ while in reality there are observation errors in measuring $a$ and $b$ and SVD finds the regression/fitted line based on theses noisy $a$ and $b$.

Principal Component Analysis (PCA) and SVD

PCA is a projection method that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components (PCs).

Intuitively, PCA converts observations of variables into a set of PCs to independently represent the original data with different importance. So we can use some of PCs to reduce data dimension while keeping the information of data as much as possible.

(The PC searching process is sequential, in each step, PCA identifies the axis that accounts for the largest amount of variance once the residuals are projected on it.)

Derivation

PCA's goal is to find orthogonal directions to represent the original data as much as possible, so we can naturally think of variance as a measure of data representation. If one direction carries the largest variance projected by the original data, this direction can represent the original data to the most extent. Let's first do some derivation \[ S = \frac{1}{n - 1} B^T B \] where $B = X - \bar X$ is the demeaned $X$ , $\bar X$ is a $n \times 1$ vector whose $i_{th}$ element is the mean of $i_{th}$ row of $X$ and dividing by $n - 1$ is a typical way to correct for the bias introduced by using the sample mean instead of the true population mean.

Now consider the projection of a vector $x \in \mathbb R^m$ onto another vector/direction $v_i$ is simply the dot product $v_i^T x$, we can then write the variance of the data projected onto vector $v_1$ as \[ \frac{1}{n - 1} v_1^T (X - \bar X)^T (X - \bar X) v_1 = v_1^T S v_1 \] and when this projected variance reaches to the maximum, $v_1 = \underset{||v_1||=1}{\operatorname{argmax}} v_1^T S v_1$ finds the first principal component (Note: $v_1$ itself is not a principal component). Then we continue this process by projecting $B$ onto other directions $v_2$ vertical to $v_1$ then onto $v_3$ vertical to $v_2$. Finally, we will find the first $v_k$ are the first $k$ eigenvectors of $S$ corresponding to the first $k$ largest eigenvalues. This follows the eigen-decomposition format \[ S V = V D \] where $V$ is the eigenvector matrix of covariance matrix $S$ and $D$ is the eigenvalue matrix. And if we use SVD to the data matrix $B$, we will note the eigenvector $V$ of covariance matrix $S$ equals to the right singular vector $V$, and the singular value is proportional to the square root of the eigenvalues of $S$. \[ B = U \Sigma V^T \\ S = \frac{1}{n - 1} B^T B = \frac{1}{n - 1} V \Sigma^T \Sigma V^T \\ S V = V \frac{\Sigma \Sigma^T}{n - 1} \] So now we have several ways to compute the principal components \[ T = B V = U \Sigma \] where $T$ represents the principal components matrix while $V$ can be obtained using SVD to $B$ or eigen-decomposition to $S$.