Essence of Linear Algebra

Announcements

Contents below are basically notes of Essence of linear algebra from 3Blue1Brown.

What is exactly a vector?

There are mainly 3 perspectives as below

Physics student - Vectors are arrows with specific directions and lengths.
CS student - Vectors are ordered lists of numbers.
Mathematician - Vectors are arrows and at the same time ordered lists of numbers.

If there is a coordinate system say $x-y$ plane, we can easily imagine an arrow with its tail sitting at the origin. No matter which direction it points to and how long it is, it is a 2-dimensional vector which can be easily visualized. In the meantime, the coordinates of this vector is a pair of numbers that tell you how to get from the tail of the vector, at the origin, to its tip. For example, the first number of vector $\vec{v} = [2 \space 3]^T$ tells you how far to walk along the $x$-axis and the second number tells you how far to walk along the $y$-axis after that. The same goes for n-dimensional space.

Vector Operations

With the coordinates definition, it is straightforward to define the vector addition. Imagine there are 2 vectors $\vec{u}, \vec{v}$ and move the second one so that its tail sits at the tip of the first one. Then draw a new vector from the tail of the first one to the tip of the second one and that new vector is the sum of these 2 vectors. \[ \vec{u} + \vec{v} = \begin{bmatrix} u_{1} \\ u_{2} \end{bmatrix} + \begin{bmatrix} v_{1} \\ v_{2} \end{bmatrix} = \begin{bmatrix} u_{1} + v_{1} \\ u_{2} + v_{2} \end{bmatrix} \] And another vector operation is multiplication by a number. $2\vec{v}$ simply means stretching the original vector so that it's twice the original length. $\frac{1}{3}\vec{v}$ means squishing $\vec{v}$ so that it's $\frac{1}{3}$ of the original length. This process is called scaling. And these numbers to scale these vectors are scalars. \[ a \vec{v} = a \begin{bmatrix} v_{1} \\ v_{2} \end{bmatrix} = \begin{bmatrix} a v_{1} \\ av_{2} \end{bmatrix} \]

Linear combinations, span, and basis vectors

Let us look at a vector $\vec{v} = \begin{bmatrix} 2 \\ 3 \end{bmatrix}$. If we use the above vector operations to express this vector using 2 special vectors $\vec{i} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}$ and $\vec{j} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$ which are 2 unit vectors in x-direction and y-direction (Also, $\vec{i}$ and $\vec{j}$ are the typical basis vectors of the x-y coordinate system), we can get a linear combination $\vec{v} = 2 \vec{i} + 3\vec{j}$. It is natural to think of $\vec{v}$ as adding $\vec{i}$ scaled by 2 and $\vec{j}$ scaled by 3. This "adding scaled vectors" process is using linear combinations of basis vectors to express any 2-D vector. And this is a key point if we are to discuss concepts below.

Span

The span of $\vec{u}$ and $\vec{v}$ is the set of all their linear combinations - $a \vec{u} + b \vec{v}$ with $a$ and $b$ varying over all real numbers. In other words, the span of these 2 vectors is also defining what are all the possible vectors you can reach using these 2 vectors and 2 fundamental operations - vector addition and scalar multiplication. If $\vec{u}$ and $\vec{v}$ line up, their span is just a line. If $\vec{u}$ and $\vec{v}$ are both zero vectors, their span is just a point. In most cases, their span is the entire infinite sheet of 2-D space.

From this perspective, linearly dependent vectors arise when one vector can be removed from a set of multiple vectors without reducing the span. And this vector can be expressed as a linear combination of the others because it's already in the span of the others. In other words, if each vector does add another dimension to the span, they are linearly independent.

Basis

The basis of a vector space is a set of linearly independent vectors that span the full space.

Matrices and Linear Transformations

Transformation is a fancy word for function. In the context of linear algebra, we would like to think about transformations that take in some input vector and spit out another vector. And the word transformation suggests how an input vector is converted to the output vector. It may experience spinning, stretching or reversing.

A transformation is linear if 1) all lines must remain lines without getting curved and 2) the origin must remain fixed in place. In general, linear transformations can be seen as keeping grid lines parallel and evenly spaced. A straightforward example is a rotation about the origin.

Now the question now becomes how should we describe any linear transformation numerically? The answer is super simple - we can show the transformed basis vector matrix to represent this process. An intuitive explanation is that given the basis vectors and every vector in their span is a certain linear combination of the basis. And because of the property of keeping grid lines parallel and evenly spaced in linear transformations, a vector starting off a certain linear combination of basis vectors still ends up the same linear combination of the transformed basis vectors. Mathematically, we can express the process \[ i = \begin{bmatrix} 1 \\ 0 \end{bmatrix} \rightarrow \begin{bmatrix} 1 \\ -2 \end{bmatrix}, \space j = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \rightarrow \begin{bmatrix} 3 \\ 0 \end{bmatrix} \\ \] as the basis vector transformation, and the process \[ \begin{bmatrix} x \\ y \end{bmatrix} = x\begin{bmatrix} 1 \\ 0 \end{bmatrix} + y \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \rightarrow \begin{bmatrix} x \\ y \end{bmatrix} = x\begin{bmatrix} 1 \\ -2 \end{bmatrix} + y \begin{bmatrix} 3 \\ 0 \end{bmatrix} = \begin{bmatrix} 1 & -2\\ 3 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \] as linear transformation of any vector. This is exactly "adding scaled vectors" stated above in Linear Combination chapter. If we omit the vector $\begin{bmatrix} x \\ y \end{bmatrix}$ to simplify the process and we can get \[ \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix} \rightarrow \begin{bmatrix} 1 & -2\\ 3 & 0 \end{bmatrix} \] As we can see, the column vector tells what the original basis vector $i$, $j$ become after transformed. Obviously, the basis vector matrix transforms to another one representing a certain type of linear transformation and all vectors in the original span follow that.

If I am given a matrix $\begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix}$, I will say it indicates a $90^\circ$ counterclockwise rotation in 2-D space. Correspondingly, $\begin{bmatrix} 0 & 1 \\ -1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix}$ means rotating vector $\begin{bmatrix} x \\ y \end{bmatrix}$ $90^\circ$ counterclockwise about the origin. The important thing of introducing linear transformation is seeing any matrix as a certain linear transformation because it will be easier to understand concepts like matrix multiplication, determinant, eigenvectors and others.

Matrix Multiplication as Composition

Now consider 3 matrices \[ R = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix}, \space S = \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix}, RS = \begin{bmatrix} 1 & -1 \\ 1 & 0 \end{bmatrix} \] where $R$ represents a $90^\circ$ clockwise rotation, $S$ represents a shear and $RS$ means a rotation and shear. Note that $RS$ describes an overall effect of a rotation then a shear. It is equivalent to carrying out 2 successive actions to a vector like below. \[ \begin{bmatrix} 1 & 1 \\ 0 & 1 \end{bmatrix} \left( \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \right) = \begin{bmatrix} 1 & -1 \\ 1 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \] where left hand side shows first rotating and shearing vector $\begin{bmatrix} x \\ y \end{bmatrix}$ and right hand side shows the composite transformation. And this is because of the geometric meaning of matrix multiplication which applying one transformation then another.

Now we take a loot at how matrix multiplication is done mathematically \[ \begin{bmatrix} e & f \\ g & h \end{bmatrix} \begin{bmatrix} a & b \\ c & d \end{bmatrix} = \begin{bmatrix} a \begin{bmatrix} e \\ g \end{bmatrix} + c \begin{bmatrix} f \\ h \end{bmatrix} && b \begin{bmatrix} e \\ g \end{bmatrix} + d \begin{bmatrix} f \\ h \end{bmatrix} \end{bmatrix} \] we can think of matrix multiplication as transforming basis vectors under different rules. Naturally, the first column vector is the vector $\begin{bmatrix} a \\ c \end{bmatrix}$ after transformed and second column follows the similar procedure.

Here are some matrix multiplication properties which can be easily proved by this thought.

Associativity
$(AB)C = A(BC)$ can be seen as first applying transformation represented by $C$ and then $AB$ and also can be seen as first applying composite transformation represented by $BC$ and then $A$. They are equivalent in transforming as the composite transformation $ABC$.
Commutativity
$AB \neq BA$ can be proved if A is a rotation and B is a shear. $AB$ is a shear-then-rotate transformation that will make basis vectors point close together while $BA$ is a rotate-then-shear transformation giving basis vectors pointing far part.

Determinant

We have known any matrix represents a certain type of linear transformation and we describe the scaling size of a linear transformation represented by a matrix using determinant. If the determinant of a transformation is $3$ then this transformation increases the area of a region by a factor of $3$. Let's look at the below matrix \[ \begin{bmatrix} 1 & 0 \\ 0 & 2 \end{bmatrix} \] whose linear transformation means stretching all vectors in $y$ direction by the factor of $2$. This turns out to increase all areas by the factor of $2$. And this scaling factor is exactly described by determinant of the matrix.

Sometimes, the determinant of a matrix can be negative. In this case, the absolute value of the determinant still indicates the scaling factor while the negative sign means the orientation determined by basis vectors is now different from the original. To make it more clearly, we can think the original 2-D space as a sheet and the sheet is now flipped after the linear transformation is done.

Mathematically, the determinant of a 2-D matrix is computed as \[ det\left( \begin{bmatrix} a & b \\ c & d \end{bmatrix} \right) = ab - cd \] and $ab - cd$ is exactly the area of parallelogram whose adjacent edges are $\begin{bmatrix} a \\ c \end{bmatrix}$ and $\begin{bmatrix} b \\ d \end{bmatrix}$. And if we extend the determinant computation to 3-D, we will see the determinant is exactly the volume of a parallelepiped spanned by the matrix's column vectors.

Also, determinant of matrices satisfy the following rules \[ det(M_1 M_2) = det(M_1) det(M_2) \] because the transformation represented by $M_1 M_2$ is equivalent to carrying out $M_1$ and $M_2$ successively and therefore the overall scaling factor is the multiplication of the separate scaling factors.

Inverse Matrices, Column Space and Null Space

From the perspective of linear transformations, these concepts will look different if we understand these concepts in the usual computation way. Let's first look at a system of equations \[ \begin{matrix} 2x + 5y + 3z = -3 \\ 4x + 0y + 8z = 0 \\ 1x + 3y + 0z = 2 \end{matrix} \] which is super familiar when we were at primary school. But if we present this linear system using matrix multiplication, we can get \[ \begin{bmatrix} 2 & 5 & 3 \\ 4 & 0 & 8 \\ 1 & 3 & 0 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = \begin{bmatrix} -3 \\ 0 \\ 2 \end{bmatrix} \] which is also very intuitive if we express the above system as $A \vec{x} = \vec{b}$ recall the geometric meaning of matrix multiplication. Obviously, $A$ indicates a linear transformation and solving $A \vec{x} = \vec{b}$ means we are looking for an $\vec{x}$ which lands on $\vec{b}$ after transformed. In short, we can think of a certain vector is stretched and rotated to become $\begin{bmatrix} -3 \\ 0 \\ 2 \end{bmatrix}$ and this is exactly what we are looking for. How? Let's first consider the situation where $det(A) \neq 0$ meaning the transformation doesn't shrink the space dimension. And we can use $A$'s inverse to get the solution \[ A^{-1} A \vec{x} = \vec{x} = A^{-1} \vec{b} \] Note that $\vec{x}$ is transformed to $\vec{b}$ under $A$ and $\vec{b}$ is transformed back to $\vec{x}$ under $A^{-1}$. This is pretty similar to the concept of functions and inverse functions \[ f(x) = y \\ f^{-1}(y) = x \\ f^{-1}(f(x)) = f^{-1}(y) = x \] we can see any value becomes itself if mapped by a function then mapped by the corresponding inverse function. And the same idea goes for linear transformations. \[ A^{-1} A \vec{x} = \vec{x} \] In general, $A^{-1}$ is a unique transformation that we will end up back where we started if we apply $A$ then apply $A^{-1}$. $A^{-1} A$ comes to a transformation that does nothing and this is also called identity transformation. Geometrically, $A^{-1}$ transforms every vector back to what they are before they are transformed by $A$. For example, if $A$ is a counterclockwise rotation by $90^{\circ}$ and then $A^{-1}$ is a clockwise rotation by $90^{\circ}$.

However, if $det(A) = 0$, it means $A$ squishes a high-dimension space into a low-dimension space like squishing a plane into a line. At this time, there is no inverse matrix $A^{-1}$ because we cannot "unsquish" a line into a plane. At least, that's not something a function can do since that would require an individual vector to convert to a multiple vectors while function is always a 1-to-1 mapping.

We have a new terminology rank to describe these situations where $det(A) = 0$. For a 3-d matrix $A$, if the output of the transformation is a line meaning it's one-dimensional, we say the transformation $A$ has a rank of $1$. Similarly, if the output is a plane meaning it's two-dimensional, we say the transformation $A$ has a rank of $2$. So rank means the number of dimensions in the output of a transformation.

To sum up, the set of all possible outputs $A \vec{x}$ is called the column space of the matrix $A$. This is pretty natural because the column vectors of $A$ tells us where the basis vectors land after transformation. If the rank of a matrix is as high as it can be, it means it equals the number of columns and we call the matrix full rank. If all basis vectors land on a line for a 3-D matrix, then the column space is a line and $rank(A) = 1$. Now solving the equation has become the question that if the target vector $\vec{b}$ is within the span of columns of $A$. Let's continue with the above example. If $\vec{b}$ happens to be on that line where all basis vectors land after transformation, then there are infinite solutions to that equation. However, if $\vec{b}$ happens to be out of scope of the span, there is no solution.

Let's also consider a special vector which is always in the column space whatever the transformation, and this vector is called zero vector. And the set of all possible vectors that land on the origin after transformation is called null space or kernel of your matrix. It's the space of all vectors that become null. And when we try to solve an equation like $A \vec{x} = \vec{0}$, the null space gives us all possible solutions to this equation. Also, we call this kind of equations with the name of Homogeneous Linear Equations.

Non-square Matrices as Transformations between Dimensions

For a non-square matrix, we also use linear transformation perspective to interpret the geometric meaning it stands for. For example, \[ \begin{bmatrix} 3 & 1 \\ 4 & 1 \\ 5 & 9 \end{bmatrix} \] which is a $3 \times 2$ matrix and its column vectors still indicate where the original 2-D space basis vectors land after transformation. As we can see, 2 columns indicate that the input space has 2 basis vectors and 3 rows indicate that the landing spots for each of these basis vectors is described with three separate coordinates. This matrix reveals a transformation that maps 2 dimensions to 3 dimensions. However, this matrix is still full rank because the number of column equals the rank of the column space. Note, the rank is still 2 even the matrix represents a mapping from 2-D to 3-D. This is because the set of all possible outputs after transformation still span a plane in 3-D space instead of a 3-D space.

Similarly, the below $2 \times 3$ matrix \[ \begin{bmatrix} 3 & 1 & 5\\ 4 & 1 & 5\\\end{bmatrix} \] represents a mapping from 3-D to 2-D because the 3 columns indicate the input space has 3 basis vectors while the 2 rows indicate the landing spots for each of these basis vectors is described with only 2 coordinates. We can think of this process as squishing and projecting the 3 orthogonal basis onto a 2-D plane.

Now, let's look at a $1 \times 2$ matrix \[ \begin{bmatrix} 3 & 2 \end{bmatrix} \] which represents the process of smashing a plane into a line while keeping evenly spaced dots remain evenly spaced after mapped.

To sum up, for non-square matrices, the number of columns and the number of rows represent the dimensions of input space and the dimensions of output space respectively. And there is no determinant for non-square matrices. This is because the determinant of a matrix indicates the scaling size of transformation in the same space and within the same dimension. However, we cannot measure how the size of space change over dimensions.

Dot Products and Duality

A fuller understanding of the role the dot products play in math can only be found in the light of linear transformations.

Let's first review the standard introduction of dot products and its geometric meaning. \[ \vec{v} \cdot \vec{w} = \begin{bmatrix} a \\ b \end{bmatrix} \cdot \begin{bmatrix} c \\ d \end{bmatrix} = ac + bd\\ \vec{v} \cdot \vec{w} = (Length \space of \space projected \space \vec{w}) (Length \space of \space \vec{v}) \\ \vec{v} \cdot \vec{w} = (Length \space of \space projected \space \vec{v}) (Length \space of \space \vec{w}) \] One surprisingly amazing property of dot products is that the order of this projection and multiplication process doesn't matter. We can project $\vec{v}$ onto $\vec{w}$ and multiply the projected length of $\vec{v}$ by the length of $\vec{w}$ and we can also project $\vec{w}$ onto $\vec{v}$ and multiply the projected length of $\vec{w}$ by the length of $\vec{v}$. And this actually can be proved by building similar triangles or as follows \[ \vec{v} \cdot \vec{w} = |\vec{w}| cos<\vec{v}, \vec{w}> |\vec{v}| = |\vec{w}| |\vec{v}| cos<\vec{v}, \vec{w}> \] And another tricky point is how is the perspective of projection and multiplication associated with the perspective of multiplying coordinates of pairs and adding them together?

To answer this, let's recall the geometric meaning of $1 \times 2$ matrix covered in last chapter, say a matrix like below \[ \begin{bmatrix} 3 & -2 \end{bmatrix} \] which means a transformation where 2 basis vectors in 2-D space have now landed on 3 and -2 on a 1-D number line. And if we apply this transformation to a certain 2-D vector, we can get \[ \begin{bmatrix} 3 & -2 \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix} = 3a -2b \] where $3a-2b$ is exactly where the original vector $\begin{bmatrix} a \\ b \end{bmatrix}$ land on the number line after transformed. And this matrix multiplication operation is numerically equivalent to the dot products between $\begin{bmatrix} a \\ b \end{bmatrix}$ and $\begin{bmatrix} 3 \\ -2 \end{bmatrix}$. So it's natural to declare there is a nice association between $1 \times 2$ matrix and a 2-D vector. Let's look at the image below

In the 2-D coordinate system, we have a vector $\vec{u}$ and we also have basis vectors sitting on $x$ and $y$ axis. Also, we draw a number line through $\vec{u}$ to show where these 2-D vectors will land. From the image, using a line of symmetry, we can easily tell the basis vector $\hat{i}$ sitting on $x$ axis is converted to a number exactly the same as the $x$ coordinate of $\vec{u}$. And the same goes for the other basis vector $\hat{j}$. Till now, we have found a 2-D to 1-D linear projection transformation restricted by $\vec{u}$ and the entries of corresponding $1 \times 2$ matrix describing the transformation are exactly the coordinates of $\vec{u}$. So this just explains why taking a dot product among vectors can be interpreted as projecting a vector onto the span of the other one and taking the length. \[ \begin{bmatrix} u_x & u_y \end{bmatrix} \begin{bmatrix} a \\ b \end{bmatrix} = a \cdot u_x + b \cdot u_y \\ \begin{bmatrix} u_x \\ u_y \end{bmatrix} \cdot \begin{bmatrix} a \\ b \end{bmatrix} = a \cdot u_x + b \cdot u_y \\ \] Let us think about the process again! We had a linear transformation from 2-D space to the number line which was not defined by numerical dot products. It was just defined by projecting space onto a copy of the number line decided by the vector $\vec{u}$. Because the transformation is linear, it was necessarily described by some $1 \times 2$ matrix whose entries are the coordinates $\vec{u}$. And since multiplying this matrix by another 2-D vector $\vec{v}$ is the same as taking a product between $\vec{u}$ and $\vec{v}$. This transformation and the vector is inescapably related to each other. The punch line here is, for any linear transformation whose output space is the number line, there is going to be a unique vector corresponding to that transformation. In this sense, applying the transformation is the same thing as taking a product. This is an example duality: the dual of a linear transformation from $n$ dimension to $1$ dimension is a vector in that $n$ dimension.

A takeaway here is that a vector sometimes can be interpreted as a linear transformation instead of an arrow in space.

Cross Product

For 2-D vectors, the cross product of 2 2-D vectors is a new vector. And the length of the vector will be the area of a parallelogram spanned by these 2 vectors. The direction of the new vector is going to be perpendicular to that parallelogram and can be told with right hand rules. Specifically, the cross product is defined as \[ \begin{bmatrix} v_1 \\ v_2 \\ v_3 \end{bmatrix} \times \begin{bmatrix} w_1 \\ w_2 \\ w_3 \end{bmatrix} = det \left( \begin{bmatrix} \hat{i} & v_1 & w_1 \\ \hat{j} & v_2 & w_2 \\ \hat{k} & v_3 & w_3 \end{bmatrix} \right) \] Recall that for a $2 \times 1$ matrix, there is always a 2-D vector (which is the dual vector of that transformation) that corresponds to it. And performing the transformation is the same as taking a product with that vector. This is called duality. While this does not only apply to $2 \times 1$ matrix, it also applies to any matrix if the corresponding linear transformation's output space is $1$ dimension. And the cross product also embodies the idea of duality.

To explain how duality is applied in cross product, let's plan to

Define a 3d-to-1d linear transformation in terms of $\hat{v}$ and $\hat{w}$,
Find its dual vector $\hat{p}$ in 3-D space,
Show that this dual vector $\hat{p} = \hat{v} \times \hat{w}$.

And this is all because this transformation displays the connection between the computation and the geometry of the cross product.

Recall that in 2D space, the cross product of $\vec{v}$ and $\vec{w}$ is simply the determinant of the matrix whose column vectors are $\vec{v}$ and $\vec{w}$. This is also the are of the parallelogram spanned by these 2 vectors. And we will naturally think of the volume of some parallelepiped as the cross product among 3D vectors. But the question is how the parallelepiped looks like? Now we consider a function \[ L\left( \begin{bmatrix} x \\ y \\ z \end{bmatrix} \right) = det \left( \begin{bmatrix} x & v_1 & w_1 \\ y & v_2 & w_2 \\ z & v_3 & w_3 \end{bmatrix} \right) \] which describes a parallelepiped spanned by $\vec{v}$, $\vec{w}$ and an unknown 3D vector. And an important feature about this function is its linearity. Based on that, we can bring the idea of duality, which means we can introduce a $1 \times 3$ matrix to describe the 3D-to-1D transformation, \[ \begin{bmatrix} v_2 w_3 - v_3 w_2 & v_3 w_1 - v_1 w_3 & v_1 w_2 - v_2 w_1 \end{bmatrix} \begin{bmatrix} x \\ y \\ z \end{bmatrix} = det \left( \begin{bmatrix} x & v_1 & w_1 \\ y & v_2 & w_2 \\ z & v_3 & w_3 \end{bmatrix} \right) \] and there is a corresponding vector and taking dot product of it is equivalent as performing that transformation. \[ \begin{bmatrix} v_2 w_3 - v_3 w_2 \\ v_3 w_1 - v_1 w_3 \\ v_1 w_2 - v_2 w_1 \end{bmatrix} \cdot \begin{bmatrix} x \\ y \\ z \end{bmatrix} = det \left( \begin{bmatrix} x & v_1 & w_1 \\ y & v_2 & w_2 \\ z & v_3 & w_3 \end{bmatrix} \right) \] So the function above is built to find such a vector $\vec{p}$ that taking a dot product between $\vec{p}$ and $\begin{bmatrix} x \\ y \\ z \end{bmatrix}$ is equivalent to the determinant of the matrix whose column vectors are $\begin{bmatrix} x \\ y \\ z \end{bmatrix}$, $\vec{v}$ and $\vec{w}$.

And this also gives the geometric meaning of $\vec{p}$. Because $\vec{p}$ is such a 3D vector that taking a dot product between $\vec{p}$ and $\begin{bmatrix} x \\ y \\ z \end{bmatrix}$ is equivalent to the signed volume of the parallelepiped whose spanned by $\begin{bmatrix} x \\ y \\ z \end{bmatrix}$, $\vec{v}$ and $\vec{w}$. To state the geometric property more clearly, let's decompose the volume of the above parallelepiped as

\[ (Area \space of \space parallelogram \space spanned \space by \space \vec{v}, \vec{w}) \times (Component \space of \begin{bmatrix} x \\ y \\ z \end{bmatrix} perpendicular \space to \space \vec{v}, \vec{w}) \] From this perspective, the function above is projecting the vector $\begin{bmatrix} x \\ y \\ z \end{bmatrix}$ onto a line perpendicular to $\vec{v}$ and $\vec{w}$, then multiplying the length of the projection by the area of the parallelogram spanned by $\vec{v}$ and $\vec{w}$. Also, this is the same as taking a product between $\begin{bmatrix} x \\ y \\ z \end{bmatrix}$ and a vector perpendicular to $\vec{v}$, $\vec{w}$ with a length $=$ the area of that parallelogram. So this is the geometric meaning of $\vec{p}$.

To integrate the geometry and computation perspective, $\vec{p}$ and $\vec{v} \times \vec{w}$ are 2 dual vectors of the same linear transformation, so they must be the same. So we have presented how the cross product of two 3D vectors is computed and its geometric meaning.

Change of Basis

Till now, we have always used a coordinate system to translate between vectors and a set of numbers. And there are 2 special vectors $\hat{i}$ and $\hat{j}$ called basis vectors of the standard coordinate system. Each vector in the coordinate system is a linear combination of these basis vectors. Now let's think about what will happen if we change the set of basis vectors into a different set.

Let's consider another set of basis vectors $\hat{b_1} = \begin{bmatrix} 2 \\ 1 \end{bmatrix}$ and $\hat{b_2} = \begin{bmatrix} -1 \\ 1 \end{bmatrix}$, this is just like $\hat{i} = \begin{bmatrix} 1 \\ 0 \end{bmatrix}$ and $\hat{b_2} = \begin{bmatrix} 0 \\ 1 \end{bmatrix}$ in our system. Now it's natural to know how to translate vectors between different coordinate systems. For example, if a vector is expressed as $\begin{bmatrix} 2 \\ 1 \end{bmatrix}$ in the $\hat{b_1}$, $\hat{b_2}$ system, what does it look like in our system? Likewise, if a vector is expressed as $\begin{bmatrix} 2 \\ 1 \end{bmatrix}$ in our system, what does it look like in the $\hat{b_1}$, $\hat{b_2}$ system?

Now let's look at the matrix whose columns are $\hat{b_1}$, $\hat{b_2}$ as below \[ \begin{bmatrix} 2 & -1 \\ 1 & 1\end{bmatrix} \] which transforms the basis vectors into a new set of basis vectors, and also transforms the original coordinate system into a new coordinate system. But numerically, this new basis vectors are still expressed using our original language. Therefore, if a vector is expressed as $\begin{bmatrix} x \\ y \end{bmatrix}$ in our language, what it will look like in the $\hat{b_1}$. $\hat{b_2}$ system is \[ \begin{bmatrix} 2 & -1 \\ 1 & 1\end{bmatrix} \begin{bmatrix} x \\ y\end{bmatrix} \] and this is because a vector is always the same linear combination of basis vectors whatever the transformation is.

In the opposite, if a vector is expressed as $\begin{bmatrix} x \\ y \end{bmatrix}$ in the $\hat{b_1}$. $\hat{b_2}$ system, what it is in our original system? \[ \begin{bmatrix} 2 & -1 \\ 1 & 1\end{bmatrix}^{-1} \begin{bmatrix} x \\ y\end{bmatrix} \] and this is because of the same reason as above.

Since vectors are not the only thing expressed with coordinates, the question now becomes how we translate matrices/ linear transformations between different coordinate systems? For example, what a $90^{\circ}$ clockwise rotation looks like in the $\hat{b_1}$. $\hat{b_2}$ system? \[ \begin{bmatrix} 2 & -1 \\ 1 & 1\end{bmatrix}^{-1} \begin{bmatrix} 1 & 0 \\ 0 & -1\end{bmatrix} \begin{bmatrix} 2 & -1 \\ 1 & 1\end{bmatrix} \] this process is like first performing the transformation to express basis vectors in our language and then performing rotation. This intermediate matrix is the rotated $\hat{b_1}$, $\hat{b_2}$ basis vectors in our language. The remaining step is to translate the intermediate one into a final one in the language of $\hat{b_1}$, $\hat{b_2}$ system. The composition of these 3 matrices gives us the $90^{\circ}$ clockwise rotation in the language of $\hat{b_1}$, $\hat{b_2}$ system.

Whenever we see an expression like $A^{-1} M A$, it suggests a translation or an empathy in a mathematical way, where $M$ represents an intuitive linear transformation and the other 2 matrices represent the empathy, the translator or the shift in perspective. It still indicates the same transformation but from other perspective.

Eigenvectors and Eigenvalues

Let's think about a matrix like $\begin{bmatrix} 3 & 1 \\ 0 & 2\end{bmatrix}$ and a random vector $\begin{bmatrix} x \\ y \end{bmatrix}$. If we apply the linear transformation represented by that matrix to that vector, the vector is most likely to get knocked off the vector's span (the line passing through the origin and its tip) during the transformation. But there are some special vectors that do remain on their own span, meaning the effect that the matrix has on such a vector is just stretching or squishing like a scalar. For this specific example, $\begin{bmatrix} 1 \\ 0 \end{bmatrix}$ is such a special vector which is stretched to 3 times itself and still lands on $x$ axis. And due to linearity, any other vector on the $x$ axis (the vector's span) will also be stretched out by a factor of 3 during the transformation. In summary, these vectors are so-called eigenvectors of the transformation, and there are corresponding eigenvalues measuring the factor it stretches or squishes during that transformation. If we translate what we plan to do above into mathematical terms, we can get \[ A \vec{v} = \lambda \vec{v} \] which is equivalent to solving $(A - \lambda I)\vec{v} = \vec{0}$. If we want a non-zero solution of $\vec{v}$, then the transformation $A - \lambda I$ must be a dimension reducing one, meaning a zero determinant. We can think of $\lambda$ as a disturbance term and changes the linear transformation $A$ in a way that the new transformation squishes space into a lower dimension or equivalently the new column vectors are colinear.

And why are these useful things to think about? Let's consider some 3D rotation. If we can find an eigenvector for that rotation, then we find the axis of that rotation. It's much easier to think about a 3D rotation in terms of some axis of rotation with some angle compared to a $3 \times 3$ matrix.

There are some takeaways about solving eigenvalues and eigenvectors

It doesn't matter if there is a negative eigenvalue as long as the eigenvector stays on the line it spans out without getting knocked off.
A transformation doesn't have to have eigenvectors. For example, there are no eigenvectors for a $90^{\circ}$ rotation since any non-zero vector is rotated and moves its own span.
There may be only 1 eigenvalue but more than 1 eigenvectors for a transformation. Say a transformation that stretches everything by 2. It only has the eigenvalue of 2 but every vector in the plane is the eigenvector with that eigenvalue.

What if all basis vectors are eigenvectors? If we write the above eigenvector equation in matrix format, it's $A V = V $ where columns of $V$ are eigenvectors of $A$ and $\Lambda$ is a diagonal matrix whose diagonal elements are the corresponding eigenvalues. Then the linear transformation $A$ is express as $V^{-1} A V = \Lambda$ in the language of eigen-basis system, and we can then get $A^n = V^{-1} \Lambda^{n} V$ consequently. Because $\Lambda$ is a diagonal matrix, it's much easier to compute $\Lambda^n$ compared to compute the $n_{th}$ power of a non-diagonal matrix.

What if the basis vectors are not eigenvectors? Because of the great properties above, we would like to perform change of basis so that these eigenvectors become our basis vectors. But that can only happen when there are enough eigenvectors to span a full space. Mathematically, we can get basis vectors as follows \[ V^{-1} A V = \Lambda \] and this is also called diagonalization.

Abstract Vector Spaces

Let's go back to the original question and ask what are vectors? Are they lists of numbers or ordered arrows? We won't call vectors as list of numbers considering determinant and eigenvectors don't care about the coordinate system. In summary, vectors can be specified as ordered arrows or lists of numbers but they are technically not the full definition of vectors. In mathematical terms, they are called vector spaces. And there are 8 axioms any vector space must satisfy so that those vector operations, dot products and eigen-things are valid.

Transpose of Matrices

The transpose of a matrix is flipping a matrix over its diagonal and switching its rows and columns in a way that $A_{ij} = A^T_{ji}$. But from the perspective of linear transformation, how do we interpret the transpose a matrix geometrically? There is a discussion and giving 3 perspectives of the transpose of matrices. What I find the most intuitive is the one based on Singular Value Decomposition. Let's first find out how we interpret SVD geometrically as below

which is the geometric representation of SVD $A = U \Sigma V^T$ where $V$ and $U$ are orthogonal matrices while $\Sigma$ is a diagonal matrix. And therefore the linear transformation represented by $A$ can be seen as the successive actions of first rotating ($V^T$), then scaling ($\Sigma$) and finally rotating ($U$). This is what we are getting when we look at the image above from the left to the right.

The transpose of a matrix $A^T = V \Sigma U^T$ can then be derived. At the same time, $U^T = U^{-1}$ and $V = {(V^T)}^{-1}$ hold true because they are orthogonal matrices and we can rewrite the transpose of $A$ as $A^T = {(V^T)}^{-1} \Sigma U^{-1}$. Since $A^{-1}$ can be seen as such a linear transformation that transforms whatever is transformed by $A$ back to its original state, then $V$ is rotating whatever is rotated by $V^T$ to the original and $U^T$ is rotating whatever is rotated by $U$ to the original state. And therefore we can also interpret $A^T$ as the successive actions of first rotating, then scaling and finally rotating. What is different is that $A^T$ is doing exactly the opposite actions of $A$ in terms of rotations.

With this interpretation, we can easily prove $det(A) = det(A^T)$ and here is another brief proof.

Less Intuitive Concepts and Conclusions

Interpret the trace of matrices geometrically. $Trace(AB) = Trace(BA)$.
$Rank(column \space space \space of \space A) = Rank(row \space space \space of \space A)$
Kernel, Image and Rank Nullity Theorem
We have covered the concept of kernel in Null Spaces Chapter. If there is a $m \times n$ matrix $A$ representing a linear transformation $T: V \rightarrow W$, then the kernel of $T$ is defined as \[ Kernel(T) = \{ \vec{v} \in V | T(\vec{v}) = \vec{0}\} \] where $\vec{0}$ is the zero vector in $W$.
While the image of $T$ or the range of $T$ is defined as \[ Image(T) = \{T(\vec{v}) | \vec{v} \in V\} \] Note, the image of $T$ is a subspace of the output space $W$ while the kernel of $T$ is a subspace of the input space $V$. Here is a video illustrating this.
And the Rank Nullity Theorem is stated as \[ Rank(A) + dim(Kernel(T)) = n \] Intuitively, we can think of $n$ as the number of dimensions in input space while rank means the number of dimensions in output space and $dim(Kernel(T))$ naturally is the dimension lost in performing the transformation $T$. For example, for a $1 \times 2$ matrix $A$ that squishes a space into a number line, we know that there is one dimension of information missing in the 2D-to-1D transformation.
Solutions to Linear Systems
For a linear system $Ax = b$, we can use the concept of rank and linear transformation to analyze if there is any solution to this system. From the perspective of linear transformation, $m\times n$ matrix $A$ represents a transformation and $x$ is a vector in the input vector space and $b$ is another vector in the output vector space. The equation can be interpreted as the question if vector $b$ is within the output space of $A$. Mathematically speaking, that is \[ rank([A \space b]) == rank(A) \] where $[A \space b]$ is an augmented matrix. If $rank([A \space b]) > rank(A)$, it means $b$ is out of space and there is no solution to that equation. If $rank([A \space b]) = rank(A)$, it means $b$ is within space and there is one or many solution to that equation. As for it's a scenario with one solution or many solutions, it depends on $rank(A) == n$. If $rank(A)==n$, then there is only one solution. Otherwise, there are infinite solutions.
Characteristic Polynomial of Square Matrices
The characteristic polynomial of a square matrix is a polynomial which is invariant under matrix similarity and has the eigenvalues as roots. \[ f_A(\lambda) = |\lambda I - A| \] For $B = Q^{-1}A Q$, the characteristic polynomial is \[ f_B(\lambda) = |\lambda I - Q^{-1}A Q| = |Q^{-1}(\lambda I - A)Q| = |\lambda I - A| \] which is the same as $A$'s.