The most common way to represent data is through matrices. As a result, this makes the methods and algorithms designed to operate on matrices very convenient for data science. We will dive into the mysteries of one of them. This method has been proven successful in matrix manipulations such as low-rank approximation, but also in more practical applications like recommender systems. This method is known as Single Value Decomposition (SVD). This blog post is the first part of a series around SVD.

Linear Algebra 101

For a start, We will lay out the mathematical foundations needed to understand SVD. If you are already familiar with the concepts listed below, feel free to jump to the next part of the series.

Vector Space

This section assumes that you are already familiar with concepts such as vectors. If not, you can check this video.

A Vector Space can be seen as a set of all vectors that respects a set of rules called (axioms) under vector addition \((+)\) and multiplication with a scalar \((\cdot)\). Therefore, to be vector space, we need to respect the following conditions:

  • Be a set of vectors
  • All vectors in the set must respect the axioms of a vector space

As a practical example, we will examine the space \(\mathbb{R}^2\) to see if it has all the conditions to be vector space. The space \(\mathbb{R}^2\) is a set of all vectors that have 2 real components (first condition ✅). But is it a vector space? Well, we know it is a set of vectors, but another condition is that it must also respect a set of rules under \((+)\) and \((\cdot)\). “Which rules?” you might think. Well… let’s take a look at them one by one.

1. \(\forall u, v \in V, u + v \in V\)

This simply means that if we take two vectors \(u,v\) from the set \(V\), then the vector resulting from the addition of \(u\) and \(v\) is also part of the same set. In other words, there exists no combination of operands of vectors that we can take from \(V\) that will allow us to escape out of it. \(V\) is like a prison under addition or, in the mathematical term, closed under vector addition. Let us demonstrate this axiom with \(\mathbb{R}^2\).

vector addition
Fig. 1. Illustration of a vector addition of u and v in V.

From Figure 1, we can see the result is indeed a vector. However, we need to show that it is also in \(V = \mathbb{R}^2\) (don’t forget our practical example 🙂). For the resulting vector to be in \(\mathbb{R}^2\), we need it to possess properties of vectors in that space \(\mathbb{R}^2\), that is:

  • The vector must have two components
  • Those components must be real numbers

If this is not intuitive from the graph in Figure 1, we can further explain it using algebra. Let’s denote our two vectors \(u\) and \(v\) as:

$$ \begin{align*} u = \begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} & & v = \begin{bmatrix} v_1 \\ v_2 \\ \end{bmatrix} \end{align*} $$

Therefore we have,

$$ u + v = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \end{bmatrix} $$

If \(u_1, u_2, v_1\) and \(v_2\) are real numbers, then there exists no values such that \(u_1 + v_1\) and \(u_2 + v_2\) are not real numbers (you can try to falsify this statement 🙂). Therefore if \(u, v \in \mathbb{R}^2\), then \(u + v \in \mathbb{R}^2\).

2. \(\forall u, v \in V, u + v = u + v\)

From Fig 1, we can see that the order of addition does not matter. We still reach the same point in the graph with either \(u + v\) or \(v + u\). Therefore this condition is satisfied for \(V = \mathbb{R}^2\). A shortcut to verify this axiom is through the addition of numbers. We all know that the order of addition of real numbers does not matter (2 + 5 = 5 + 2 = 7). This property is also preserved in real vectors, provided we carry element-wise addition.

3. \(\forall u, v, w \in V, u + (v + w) = (u + v) + w\)

Let \(u, v, w\) be denoted as:

$$ \begin{align*} u = \begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} & & v = \begin{bmatrix} v_1 \\ v_2 \\ \end{bmatrix} & & w = \begin{bmatrix} w_1 \\ w_2 \\ \end{bmatrix} \end{align*} $$

Therefore we have,

$$ u + (v + w) = \begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} + \begin{bmatrix} v_1 + w_1 \\ v_2 + w_2 \\ \end{bmatrix} = \begin{bmatrix} u_1 + v_1 + w_1 \\ u_1 + v_2 + w_2 \\ \end{bmatrix} = \begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \end{bmatrix} + \begin{bmatrix} w_1 \\ w_2 \\ \end{bmatrix} = (u + v) + w $$

4. \(\exists~\boldsymbol{0} \in V, u + \boldsymbol{0} = u\)

\(0 \) is called the neutral element of the addition operation. In \(\mathbb{R}^2\), there is only one vector that can be added to any other such that the output is the original vector, and this vector is:

$$ \boldsymbol{0} = \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} $$

and if we add \(\boldsymbol{0}\) to \(u\), we get:

$$ u + \boldsymbol{0} = \begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} + \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} = \begin{bmatrix} u_1 + 0 \\ u_2 + 0 \\ \end{bmatrix} = \begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} = u $$

As we can see, it does not really matter what \(u_1\) and \(u_2\) are, we still get the same value.

5. \(\forall u \in V, u + (-u) = 0\)

$$ u = \begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} $$

then,

$$ -u = \begin{bmatrix} -u_1 \\ -u_2 \\ \end{bmatrix} $$

which implies that,

$$ u + (-u) = \begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} + \begin{bmatrix} -u_1 \\ -u_2 \\ \end{bmatrix} = \begin{bmatrix} u_1 + (-u_1) \\ u_2 + (-u_2) \\ \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \\ \end{bmatrix} = \boldsymbol{0} $$

Now, let’s verify the axioms for scalar multiplication.

6. \(\forall u \in V, c \in \mathbb{R}, cu \in V\)

scalar and vector multiplication
Fig. 2. Illustration of a multiplication of u by a scalar 2 in V.

As you can see from Fig 2, multiplying the vector \(u\) by 2 produces a vector twice the size of \(u\). However, the resulting vector is still in \(\mathbb{R}^2\). We can also see this through algebra.

$$ cu = c\begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} = \begin{bmatrix} cu_1 \\ cu_2 \\ \end{bmatrix} $$

Each components of \(u\) are now \(c\) times larger and as long as \(c \in \mathbb{R}\), \(cu_1, cu_2 \in \mathbb{R}\) and this therefore means that \(cu \in \mathbb{R}^2\).

7. \(\forall u, c \in V, c \in \mathbb{R}, c(u + v) = cu + cv\)

$$ \begin{align*} c(u + v) &= c\left(\begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} + \begin{bmatrix} v_1 \\ v_2 \\ \end{bmatrix}\right) \\[10pt] &= c\left(\begin{bmatrix} u_1 + v_1 \\ u_2 + v_2 \\ \end{bmatrix}\right) \\[10pt] &= \begin{bmatrix} cu_1 + cv_1 \\ cu_2 + cv_2 \\ \end{bmatrix} = \begin{bmatrix} cu_1 \\ cu_2 \\ \end{bmatrix} + \begin{bmatrix} cv_1 \\ cv_2 \\ \end{bmatrix} = cu + cv \end{align*} $$

8. \(\forall u \in V, c, d \in \mathbb{R}, (c + d)u = cu + du\)

$$ \begin{align*} (c + d)u &= (c + d)\begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} \\[10pt] &= \begin{bmatrix} (c + d)u_1 \\ (c + d)u_2 \\ \end{bmatrix} \\[10pt] &= \begin{bmatrix} cu_1 + du_1 \\ cu_2 + du_2 \\ \end{bmatrix} = \begin{bmatrix} cu_1 \\ cu_2 \\ \end{bmatrix} + \begin{bmatrix} du_1 \\ du_2 \\ \end{bmatrix} = cu + du \end{align*} $$

9. \(\forall u \in V, c, d \in \mathbb{R}, c(du) = (cd)u\)

$$ \begin{align*} c(du) &= c\left(d\begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix}\right) \\[10pt] &= c\begin{bmatrix} du_1 \\ du_2 \\ \end{bmatrix} \\[10pt] &= \begin{bmatrix} cdu_1 \\ cdu_2 \\ \end{bmatrix} = cd\begin{bmatrix} u_1 \\ u_2 \\ \end{bmatrix} = cdu \end{align*} $$

10. \(\forall u \in V, c, 1u = u\)

This is verifiable from Fig 2 as multiplying any vector by 1 yeilds the same vector back.

Exercise:

Now, I am 100% sure you can tell me if the set below is a vector space with justifications. Leave your answer in the comments section 😉.

$$ \begin{align*} V = \{ \boldsymbol{x} : \boldsymbol{x} \in \mathbb{R}^{1024}, \: \boldsymbol{x} \neq \boldsymbol{0} \} \end{align*} $$

Linear Transformation

Now that we have described what vector spaces are, let’s talk about linear mappings or linear transformations. These are tools that permit us to go from one vector space like \(\mathbb{R}^2\) (2-dimensional real vectors) to another vector space such as \(\mathbb{R}^3\) (3-dimensional real vectors). Recall that, we can’t do this using the operations (\((+)\) and \((\cdot)\)) defined in a vector space.

We will start by understanding what a transformation is with respect to vectors. Take a look at the function below:

$$ \begin{align*} f \colon &\mathcal{V} \to \mathcal{W}\\ &x \mapsto f(x) = x^2. \end{align*} $$

As you can see, the function f maps any value \(x \in \mathcal{V}\) to a corresponding value \(f(x) \in \mathcal{W}\). In essence, we say, \(\mathcal{V}\) is the domain of definition of the function\(f\) and \(\mathcal{W}\) is called the image or the co-domain. The domain is just the set of all possible inputs of a function and the co-domain is the set of all corresping outputs for every inputs present in the domain.

Now, let’s get back to vectors. Vector transformations do not map scalar values to one another but map a vector space to another vector space and, in some cases, the same vector space (automorphism). The following transformation is a vector transformation:

$$ \begin{align*} f \colon & \: \mathbb{R}^2 \to \mathbb{R}^4\\ &x = \begin{bmatrix} x_1 \\ x_2 \\ \end{bmatrix} \mapsto f(v) = \begin{bmatrix} 0 \\ x_1 \\ x_2 \\ x_1 + x_2 \\ \end{bmatrix}. \end{align*} $$

It maps a 2-dimensional real vector \(x\) to a 4-d dimensional vector \(f(x)\) and therefore, we can say that \(\mathbb{R}^2\) is the domain and \(\mathbb{R}^4\) is the co-domain of the transformation. Simply speaking, a vector transformation is just a function that maps one vector to another such that its domain and co-domain are vector spaces. But what makes a vector transformation linear?

Let’s say we have two vector space, \(\mathcal{V}, \mathcal{W}\) and a vector transformation \(\phi \colon \mathcal{V} \mapsto \mathcal{W}\). \phi is linear if \(\forall x, y \in \mathcal{V}\) and \(\forall \lambda, \alpha \in \mathbb{R}\) the following condition is respected:

$$ \phi(\lambda x + \alpha y) = \lambda \phi (x) + \alpha \phi (y) $$

Yes, as simple as that. Now, let’s apply this rule to check if the vector transformation we defined earlier \(f\) is linear. If that’s the case then \(f(\lambda x + \alpha y) = \lambda f (x) + \alpha f (y)\). We will start by defining two vectors \(x\) and \(y\) such that:

$$ \begin{align*} x = \begin{bmatrix} x_1 \\ x_2 \\ \end{bmatrix} & & y = \begin{bmatrix} y_1 \\ y_2 \\ \end{bmatrix} \end{align*} $$

therefore,

$$ \begin{align*} f(\lambda x + \alpha y) = f\left(\lambda \begin{bmatrix} x_1 \\ x_2 \\ \end{bmatrix} + \alpha \begin{bmatrix} y_1 \\ y_2 \\ \end{bmatrix}\right) &= f\left( \begin{bmatrix} \lambda x_1 + \alpha y_1 \\ \lambda x_2 + \alpha y_2 \\ \end{bmatrix}\right) = \begin{bmatrix} 0 \\ \lambda x_1 + \alpha y_1 \\ \lambda x_2 + \alpha y_2 \\ \lambda x_1 + \alpha y_1 + \lambda x_2 + \alpha y_2 \\ \end{bmatrix} \end{align*} $$

Then, let’s try to go from this 4-d vector to \(\lambda f (x) + \alpha f (y)\)

$$ \begin{align*} \begin{bmatrix} 0 \\ \lambda x_1 + \alpha y_1 \\ \lambda x_2 + \alpha y_2 \\ \lambda x_1 + \alpha y_1 + \lambda x_2 + \alpha y_2 \\ \end{bmatrix} = \lambda\begin{bmatrix} 0 \\ x_1\\ x_2\\ x_1 + x_2\\ \end{bmatrix} + \alpha\begin{bmatrix} 0 \\ y_1 \\ y_2 \\ y_1 + y_2 \\ \end{bmatrix} &= \lambda f\left( \begin{bmatrix} x_1 \\ x_2\\ \end{bmatrix}\right) + \alpha f\left( \begin{bmatrix} y_1 \\ y_2 \\ \end{bmatrix}\right) \\ &= \lambda f (x) + \alpha f (y) \end{align*} $$

From the manipulations above, we are certain that \(f\) is a linear transformation. But there is an important property that linear transformations have. Did you know that some of them can be expressed as matrices? In particular those linear transformation that have domains and co-domains being \(\mathbb{R}^n\) and \(\mathbb{R}^m\) respectively for any \(n\) and \(m\). As a matter of fact, our linear transformation \(f\) can be expressed as:

$$ \begin{align*} f \colon & \: \mathbb{R}^2 \to \mathbb{R}^4\\ &\begin{bmatrix} x_1 \\ x_2 \\ \end{bmatrix} \mapsto \begin{bmatrix} 0 & 0 \\ 1 & 0 \\ 0 & 1 \\ 1 & 1 \\ \end{bmatrix}\begin{bmatrix} x_1 \\ x_2 \\ \end{bmatrix} \end{align*} $$

Feel free to verify that the matrix above is correct. The fact that we can represent some linear transformations of vectors into matrices is an essential notion in linear algebra. Therefore, our ability to decompose these matrices using SVD will come in handy for the next part of this series about SVD.

Orthogonality

Matrices have various properties, but we will look at one very important concept for understanding SVD, namely, orthogonality.

Vectors

Before looking at orthogonality in matrices, let’s look first at this property in vectors since a matrix is simply a column or a row of vectors. Orthogonality is expressed relation that exists among two or more vectors. It is related but not equivalent to the angle between two vectors.

angle between two vectors
Fig. 3. Angle θ between two vectors u and v.

$$ cos\:\theta = \frac{u\cdot v}{\lVert u \rVert \lVert v \rVert} $$

Two vectors are orthogonal if \(\theta = 90\deg\). In other words, two vectors are orthogonal if they are perpendicular. Therefore, we can verify that two vectors are orthogonal by checking that the following holds:

$$ \begin{align*} \frac{u\cdot v}{\lVert u \rVert \lVert v \rVert} = cos\: 90 = 0 && \textrm{or simply} && u\cdot v = 0 \end{align*} $$

Also, we call a set made up of these two vectors an orthonormal set if \(\lVert u \rVert = \lVert v \rVert = 1\).

Matrices

Hence, an orthogonal matrix is just a real square matrix whose columns and rows each form an orthonormal set. That is perpendicular vectors and with a magnitude of 1. Let’s expand more on this with an example. We will prove that the matrix below is orthogonal.

$$ A = \frac{1}{3}\begin{bmatrix} 2 & -2 & 1 \\ 1 & 2 & 2 \\ 2 & 1 & -2 \\ \end{bmatrix} $$

The matrix has 3 rows and 3 columns. The first condition is respected as we see that this matrix is square. Now let’s check if its columns and row vectors each form an orthonormal set.

Rows

$$ \textrm{Rows} = \{ \frac{1}{3}\begin{bmatrix} 2 \\ -2 \\ 1 \end{bmatrix}, \: \frac{1}{3}\begin{bmatrix} 1 \\ 2 \\ 2 \end{bmatrix}, \: \frac{1}{3}\begin{bmatrix} 2 \\ 1 \\ -2 \end{bmatrix} \} $$

TODO: complete proof

Columns

$$ \textrm{Columns} = \{ \frac{1}{3}\begin{bmatrix} 2 \\ 1 \\ 2 \end{bmatrix}, \: \frac{1}{3}\begin{bmatrix} -2 \\ 2 \\ 1 \end{bmatrix}, \: \frac{1}{3}\begin{bmatrix} 1 \\ 2 \\ -2 \end{bmatrix} \} $$

TODO: complete proof

Conclusion

The objective of this post was to lay the foundational layers that will permit us to have a smooth introduction to SVD. If you think there is another concept that may be beneficial to understanding SVD better, please feel free to comment. In the next part of the series, we will look at SVD, its geometrical interpretation in vector spaces as well as its application to low-rank approximation (with some python codes this time 👨‍💻).