Canonical Component Analysis

0. A Quick Question

Let's first look at a question:

Here is a portfolio risk report for 2 underlying yield sensitivity (like DV01)
- 10-year treasury yield sensitivity: - 100k / bp
- 30-year treasury yield sensitivity: + 100k / bp
How do we hedge this portfolio against these 2 yield risks?

The most common thing one can think of is selling 30-year treasury bonds and buying 30-year treasury bonds with a hedge ratio derived from historical data regression to hedge the above risk exposure. This kind of regression or static hedging will easily break because of yield curve reshaping. In this case, CCA (and PCA) are to be introduced in risk exposures hedging to cope with this yield curve reshaping issue.

1. An Introduction to CCA

Say there is a pair of CMT rates - ($Y_t^1$, $Y_t^2$). What CCA is trying to do is to find a cointegration vector $[1, -\gamma]^T$ to let the following equation stand

\[ Y_t^1 - \gamma \cdot Y_t^2 = \mu + \epsilon_t \] where $\epsilon_t$ is a stationary white noise process, and therefore also is a mean-reverting process.

This seems like doing OLS regression. The assumptions about the error terms are quite the same. However, the difference lies in their optimization objectives. CCA tries to find a vector to yield the most mean-reverting time series using self-predictability measure while OLS seeks to find a vector to minimize the prediction error. We will talk more about the difference later.

What if we extend a pair to a portfolio consisting of more than n CMT rates $Y_t = [Y_t^1, Y_t^2, ..., Y_t^n]^T$? The problem is also focusing on find a cointegration vector $\gamma = [\gamma_1, \gamma_2, ..., \gamma_n]^T$ to let the following equation stand

\[ \gamma^T Y_t = \mu + \epsilon_t \] Theoretically, we can find a cointegration vector to build a mean-reverting time series among as many assets as we want. But in practice, it's hard to execute them all with proper prices at one time, and the execution costs are also very high.

So how do we find the cointegration vector $\gamma$? Actually, $\gamma$ is such a vector yielded by CCA that makes $\gamma^T Y_t$ a canonical variable, even it's the least predictable variable.

2. Derivation of CCA

Let's look at how CCA is rigorously derived. There are 2 similar ways of performing CCA - one is introduced in Box-Tiao(1977) and another is introduced in Chou-Ng(1994). Here we consider the former paper.

2.1 Self-predictability Measure

Consider a $1 k $ vector process $\{\mathbb{Z_t}\}$ and let $z_t = \mathbb{Z_t} - \mu$, where $\mu$ is a convenient $1 \times k$ vector of origin which is the mean if the process is stationary. Suppose $z_t$ follows the pth order multiple autoregressive model

\[z_t = \hat z_{t-1}(1) + a_t\]

where

\[\hat z_{t-1}(1) = E(z_t|z_{t-1}, z_{t-2},..) = \sum_{l=1}^{p}z_{t-l} \pi_l\]

is the expectation of $z_t$ conditional on past history up to time $t-1$, the $\pi_l$ are $k \times k$ matrices, $\{ a_t\}$ is a sequence of independently and normally distributed $1 \times k$ vector random shocks with mean zero and covariance matrix $\Sigma$, and $a_t$ is independent of $\hat z_{t-1}(1)$ - like the assumptions in OLS. And the $AR(p)$ model can be then represented as

\[z_t (I - \sum_{l=1}^{p}\pi_l B^l) = a_t\]

where $I$ is the identity matrix and $B$ is the backshift operator such that $B z_t = z_{t-1}$.

The process $\{z_t\}$ is stationary if the determinantal polynomial in $B$, $det(I - \sum_{l=1}^{p}\pi_l B^l)$ has its zeros lying outside the unit circle, and otherwise the process will be called non-stationary.

Now, let's make the problem simpler by setting $k=1$ to narrow down to only 1 time series. Then, if the process is stationary, due to $a_t$ being independent of $\hat z_{t-1}(1)$.

\[E(z_t^2) = E(\{\hat z_{t-1}(1)\}^2) + E(a_t^2)\]

which can be also written as

\[ \sigma_z^2 = \sigma_{\hat z}^2 + \sigma_a^2\]

We can then define a quantity $\lambda$ to measure the predictability of a stationary series from its past as $\lambda = \frac{\sigma_{\hat z}^2}{\sigma_z^2} = 1 - \frac{\sigma_a^2}{\sigma_z^2}$.

Note: the derivation above only applies to 1 time series. And now $z_t$ is assumed to be stationary.

2.2 Intuition of CCA Decomposition

Now let's consider $k$ processes $z_t$ which represent $k$ different stock market indexes such as Dow Jones Average, Standard and Poors and Russell Index, etc., all of which exhibit dynamic growth.

It is natural to conjecture that each might be represented as some aggregate of one or more common inputs which may be nearly nonstationary (momentum), together with other stationary (mean-reverting) or white noise components.

In other words. This leads us to contemplate linear aggregates of the form $u_t = z_t m$, where $m$ is such a $k \times 1$ vector that make $u_t$ a momentum or mean-reverting time series.

Note: $z_t$ is a vector consisting of k time series. And different $m$ will yield different $u_t$. These different time series $u_t$ (whether mean-reverting or momentum) are aggregates. And these aggregates are derived time series from $z_t$. The process of getting $u_t$ from $z_t$ is called CCA decompostion even if these 2 processes are not in the same level ($u_t$ is a transformed or derived process.)

(Can $z_t$ be represented as a linear combination of $u_t$?)

The aggregates $u_t$ which depend most heavily on the past, namely having large $\lambda$ (refers to $u_t$'s $\lambda$), may serve as useful composite indicators of the overall growth of the stock market (momentum). By contrast, the aggregates with $\lambda$ nearly zero may reflect stable contemporaneous relationships (mean-reverting) among the original indicators.

The analysis given in this paper yields $k$ 'canonical' components $u_t$ from least to most predictable. Thus we may usefully decompose the k-dimensional space of the observation $z_t$ into stationary and non-stationary subspaces.

2.3 Derive Canonical Variables

Let $\Gamma_j(z) = E(z_t^T z_{t-j})$ be the lag $j$ autocovariance matrix of $z_t$. In the variance form, we have

\[ \Gamma_0(z) = \sum_{l=1}^{p} \Gamma_l(z)\pi_l + \Sigma= \Gamma_0(\hat z) + \Sigma \]

say, where $\Gamma_0(\hat z)$ is the covariance matrix of $\hat z_{t-1}(1)$. Until further notice, we shall assume that $\Sigma$ and therefore $\Gamma_0(z)$ are postive-defnite.

Now, consider the linear combination $u_t = z_t m$. For $u_t$, we have that $u_t = \hat u_{t-1}(1) + v_t$, where $\hat u_{t-1}(1) = \hat z_{t-1}(1) m$ and $v_t=a_t m$. The predictability of $u_t$ from its past is therefore measured by

\[ \lambda = \sigma_{\hat u}^2 \sigma_{u}^{-2} = \{ m \Gamma_{0}(\hat z) m^T \} \{m \Gamma_{0}(z) m^T \}^{-1}\]

which can be represented in matrix form as

\[ \Lambda = M \Gamma_{0}(\hat z) \Gamma_{0}^{-1}(z) M^{-1}\]

Note: $M$ is what we are looking for. The logic is we want to find a transformed process $\{u_t\}$ which is generated by $M$ and the original $z_t$. And we derive that $M$ can be found using eigendecomposition of $\Gamma_{0}(\hat z) \Gamma_{0}^{-1}(z)$.

This is what we call eigendecomposition, and therefore we can conclude that for the maximum predictability, $\lambda$ must be the maximum eigenvalue of $\Gamma_{0}(\hat z) \Gamma_{0}^{-1}(z)$ and $m$ the corresponding eigenvector that makes $u_t$ a momentum time series. Similarly, the eigenvector that corresponds to the smallest eigenvalue will yield the least predictable combination of $z_t$. This vector is referred to as cointegration vector that is mainly used in the first question (risk hedging) mentioned at the very beginning.

Canonical Transformation

Let $\lambda_1, ..., \lambda_k$ be the k real eigenvalues of matrix $\Gamma_{0}(\hat z) \Gamma_{0}^{-1}(z)$. Suppose $\lambda_j$ are ordered with $\lambda_1$ the smallest, and that the k corresponding linearly independent eigenvectors, $m_1, .., m_k$ from the $k$ columns of a matrix $M$. Then, we can construct a transformed process $\{ y_t\}$, where

\[ y_t = \hat y_{t-1} (1) + b_t \]

with

\[ y_t = z_t M, b_t = a_t M, \hat y_{t-1}(1)=\sum_{l=1}^{p} y_{t-l}\pi^1_l \] where $\pi^1_l=M^{-1} \pi_l M$

We now also have

\[ \Gamma_0(y) = \Gamma_0(\hat y) + \Sigma^1 \]

where $\Gamma_0(y)=M \Gamma_0(z)M^T, \Gamma_0(\hat y)=M \Gamma_0(\hat z)M^T, \Sigma^1=M \Sigma M^T$

Note: - $ M {0}(z) {0}^{-1}(z) M^{-1} = , M _{0}^{-1}(z) M^{-1} = I - $ where $\Lambda$ is a $k \times k$ matrix with elements $(\lambda_1, .., \lambda_k)$ - $0 \leq \lambda_j < 1 \space (j=1,..,k)$ - for $i \neq j, m_i \Gamma_0(z) m_j^T = m_i \Sigma m_j^T = 0$. This makes $\Gamma_0(y), \Gamma_0(\hat y), \Sigma^1$ all diagonal (Otherwise, $\Lambda$ would not be diagonal). (And this can be proved using $m_i \Gamma_{0}(\hat z) m_j^T = \lambda_{ij} m_i \Gamma_{0}(z) m_j^T$, where $\lambda_{ij} = 0$ when $i \neq j$) - eigenvectors $M$ do not form an orthonormal basis because of the asymmetry of $\Gamma_{0}(\hat z) \Gamma_{0}^{-1}(z)$

With this diagonal property, we can conclude that this transformation has produced $k$ new components series $\{ y_{1t}, y_{2t}, .., y_{kt}\}$ which are

ordered from least predictable to most predictable (meaning self-predictability)
are contemporaneously independent
have predictable components $\{\hat y_{1(t-1)}(1), \hat y_{2(t-1)}(1), .., \hat y_{k(t-1)}(1)\}$ which are also contemporaneously independent
the same goes for $\{ b_{1t}, b_{2t}, .., b_{kt}\}$

Note: The content above goes for general time series, and the content below goes for $AR(1)$ time series.(Also $M$ above can be computed in another way.)

Example

Say we have Constant Maturity Treasury rates (CMT rates) data $\{z_t\}$ from $02/01/2012$ to $06/30/2015$, a part of which is given below.

Date	6 Mo	1 Yr	2 Yr	3 Yr	5 Yr	...	30 Yr
2015-06-30	0.11	0.28	0.64	1.01	1.63	...	3.11
2015-06-29	0.11	0.27	0.64	1.00	1.62	...	3.09
2015-06-26	0.08	0.29	0.72	1.09	1.75	...	3.25
2016-06-25	0.07	0.29	0.68	1.06	1.70	...	3.16
....	....	....	....	....	....	...	....
2012-02-01	0.09	0.13	0.23	0.31	0.72	...	3.01

In this period of time, these CMT rates time series $\{z_t\}$ are both momentum time series. When we try to fit $AR(1)$ with these series with different maturities separately, the $AR(1)$ decaying parameters are around $0.95$ - $0.99$ ($2.5$ years is not a short term and both of these rates are contemporaneously under the same influence. So in the long term, they are both presenting similar trends).

But after we do canonical transformation to construct new time series (just as we discussed above) $\{y_t\}$, the most mean-reverting series has a $0.51$ decaying parameter in $AR(1)$ fitting.

Note the constructed series $\{y_t\}$ are not corresponding to the original CMT rates series $\{z_t\}$. This is similar to what is given by PCA - the first principle component is not corresponding to the first column of the original panel data.

Application

Spot small mean-reverting portfolios.
Do CCA reconstruction to generate detrended data.