Skip to content

Commit 3dbdccd

Browse files
matheusvillasbHumphreyYangmmcky
authored
Suggestions for the SVD lecture (#361)
* Corrected some typos. Added a brief description explaining why the Eckart-Young theorem is important. Added details to how the PCA for any given matrix. Changed the n_components term in the codes to r_components, so it is in accord with the definition of the Eckward-Young theorem and doesn't create confusion with the n columns of the matrix. Added an exercise and its solution at the end (I've never used sphinx, so not sure if it's done correctly). * Apply suggestions from code review Co-authored-by: Humphrey Yang <[email protected]> * Update svd_intro.md Removed typos and excessive lines that were pointed out by HumphreyYang * Update svd_intro.md just being sure I commited all changes * Update svd_intro.md * Update svd_intro.md corrected definition of pseudo inverse of sigma matrix in the exercise * fix exercise solution syntax --------- Co-authored-by: Humphrey Yang <[email protected]> Co-authored-by: mmcky <[email protected]> Co-authored-by: mmcky <[email protected]>
1 parent 17b87e1 commit 3dbdccd

File tree

1 file changed

+169
-39
lines changed

1 file changed

+169
-39
lines changed

lectures/svd_intro.md

Lines changed: 169 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -80,9 +80,9 @@ $$
8080
and
8181
8282
* $U$ is an $m \times m$ orthogonal matrix of **left singular vectors** of $X$
83-
* Columns of $U$ are eigenvectors of $X^\top X$
84-
* $V$ is an $n \times n$ orthogonal matrix of **right singular values** of $X$
85-
* Columns of $V$ are eigenvectors of $X X^\top $
83+
* Columns of $U$ are eigenvectors of $X X^\top $
84+
* $V$ is an $n \times n$ orthogonal matrix of **right singular vectors** of $X$
85+
* Columns of $V$ are eigenvectors of $X^\top X$
8686
* $\Sigma$ is an $m \times n$ matrix in which the first $p$ places on its main diagonal are positive numbers $\sigma_1, \sigma_2, \ldots, \sigma_p$ called **singular values**; remaining entries of $\Sigma$ are all zero
8787
8888
* The $p$ singular values are positive square roots of the eigenvalues of the $m \times m$ matrix $X X^\top $ and also of the $n \times n$ matrix $X^\top X$
@@ -99,10 +99,10 @@ The matrices $U,\Sigma,V$ entail linear transformations that reshape in vectors
9999
* multiplying vectors by the diagonal matrix $\Sigma$ leaves **angles between vectors** unchanged but **rescales** vectors.
100100
101101
Thus, representation {eq}`eq:SVD101` asserts that multiplying an $n \times 1$ vector $y$ by the $m \times n$ matrix $X$
102-
amounts to performing the following three multiplcations of $y$ sequentially:
102+
amounts to performing the following three multiplications of $y$ sequentially:
103103
104104
* **rotating** $y$ by computing $V^\top y$
105-
* **rescaling** $V^\top y$ by multipying it by $\Sigma$
105+
* **rescaling** $V^\top y$ by multiplying it by $\Sigma$
106106
* **rotating** $\Sigma V^\top y$ by multiplying it by $U$
107107
108108
This structure of the $m \times n$ matrix $X$ opens the door to constructing systems
@@ -144,11 +144,11 @@ matrix $X$ of rank $p$.
144144
145145
* The **column space** of $X$, denoted ${\mathcal C}(X)$, is the span of the columns of $X$, i.e., all vectors $y$ that can be written as linear combinations of columns of $X$. Its dimension is $p$.
146146
* The **null space** of $X$, denoted ${\mathcal N}(X)$ consists of all vectors $y$ that satisfy
147-
$X y = 0$. Its dimension is $m-p$.
147+
$X y = 0$. Its dimension is $n-p$.
148148
* The **row space** of $X$, denoted ${\mathcal R}(X)$ is the column space of $X^\top $. It consists of all
149149
vectors $z$ that can be written as linear combinations of rows of $X$. Its dimension is $p$.
150150
* The **left null space** of $X$, denoted ${\mathcal N}(X^\top )$, consist of all vectors $z$ such that
151-
$X^\top z =0$. Its dimension is $n-p$.
151+
$X^\top z =0$. Its dimension is $m-p$.
152152
153153
For a full SVD of a matrix $X$, the matrix $U$ of left singular vectors and the matrix $V$ of right singular vectors contain orthogonal bases for all four subspaces.
154154
@@ -236,7 +236,7 @@ X^\top u_i & = 0 \quad i= p+1, \ldots, m
236236
\end{aligned}
237237
$$ (eq:orthoortho2)
238238
239-
Notice how equations {eq}`eq:orthoortho2` assert that the transformation $X^\top $ maps a pairsof distinct orthonormal vectors $u_i, u_j$ for $i$ and $j$ both less than or equal to the rank $p$ of $X$ into a pair of distinct orthonormal vectors $v_i, v_j$ .
239+
Notice how equations {eq}`eq:orthoortho2` assert that the transformation $X^\top $ maps a pair of distinct orthonormal vectors $u_i, u_j$ for $i$ and $j$ both less than or equal to the rank $p$ of $X$ into a pair of distinct orthonormal vectors $v_i, v_j$ .
240240
241241
242242
Equations {eq}`eq:Xfour1b` assert that
@@ -250,7 +250,7 @@ $$
250250
251251
252252
253-
Thus, taken together, the systems of quations {eq}`eq:Xfour1a` and {eq}`eq:Xfour1b`
253+
Thus, taken together, the systems of equations {eq}`eq:Xfour1a` and {eq}`eq:Xfour1b`
254254
describe the four fundamental subspaces of $X$ in the following ways:
255255
256256
$$
@@ -278,7 +278,7 @@ We have verified the four claims in {eq}`eq:fourspaceSVD` simply by performing
278278
The claims in {eq}`eq:fourspaceSVD` and the fact that $U$ and $V$ are both unitary (i.e, orthonormal) matrices imply
279279
that
280280
281-
* the column space of $X$ is orthogonal to the null space of of $X^\top $
281+
* the column space of $X$ is orthogonal to the null space of $X^\top $
282282
* the null space of $X$ is orthogonal to the row space of $X$
283283
284284
Sometimes these properties are described with the following two pairs of orthogonal complement subspaces:
@@ -339,7 +339,7 @@ print("Right null space:\n", null_space.T)
339339
340340
Suppose that we want to construct the best rank $r$ approximation of an $m \times n$ matrix $X$.
341341
342-
By best we mean a matrix $X_r$ of rank $r < p$ that, among all rank $r$ matrices, minimizes
342+
By best, we mean a matrix $X_r$ of rank $r < p$ that, among all rank $r$ matrices, minimizes
343343
344344
$$ || X - X_r || $$
345345

@@ -350,7 +350,7 @@ of dimension $m \times n$.
350350

351351
Three popular **matrix norms** of an $m \times n$ matrix $X$ can be expressed in terms of the singular values of $X$
352352

353-
* the **spectral** or $l^2$ norm $|| X ||_2 = \max_{y \in \textbf{R}^n} \frac{||X y ||}{||y||} = \sigma_1$
353+
* the **spectral** or $l^2$ norm $|| X ||_2 = \max_{||y|| \neq 0} \frac{||X y ||}{||y||} = \sigma_1$
354354
* the **Frobenius** norm $||X ||_F = \sqrt{\sigma_1^2 + \cdots + \sigma_p^2}$
355355
* the **nuclear** norm $ || X ||_N = \sigma_1 + \cdots + \sigma_p $
356356

@@ -360,8 +360,13 @@ $$
360360
\hat X_r = \sigma_1 U_1 V_1^\top + \sigma_2 U_2 V_2^\top + \cdots + \sigma_r U_r V_r^\top
361361
$$ (eq:Ekart)
362362
363+
This is a very powerful theorem that says that we can take our $ m \times n $ matrix $X$ that in not full rank, and we can best approximate it by a full rank $p \times p$ matrix through the SVD.
363364
364-
You can read about the Eckart-Young theorem and some of its uses here <https://en.wikipedia.org/wiki/Low-rank_approximation>.
365+
Moreover, if some of these $p$ singular values carry more information than others, and if we want to have the most amount of information with the least amount of data, we can take $r$ leading singular values ordered by magnitude.
366+
367+
We'll say more about this later when we present Principal Component Analysis.
368+
369+
You can read about the Eckart-Young theorem and some of its uses [here](https://en.wikipedia.org/wiki/Low-rank_approximation).
365370
366371
We'll make use of this theorem when we discuss principal components analysis (PCA) and also dynamic mode decomposition (DMD).
367372
@@ -465,7 +470,7 @@ print(f'rank of X = {rr}')
465470
466471
**Properties:**
467472
468-
* Where $U$ is constructed via a full SVD, $U^\top U = I_{p\times p}$ and $U U^\top = I_{m \times m}$
473+
* Where $U$ is constructed via a full SVD, $U^\top U = I_{m\times m}$ and $U U^\top = I_{m \times m}$
469474
* Where $\hat U$ is constructed via a reduced SVD, although $\hat U^\top \hat U = I_{p\times p}$, it happens that $\hat U \hat U^\top \neq I_{m \times m}$
470475
471476
We illustrate these properties for our example with the following code cells.
@@ -490,10 +495,10 @@ UhatUhatT, UhatTUhat
490495
491496
**Remarks:**
492497
493-
The cells above illustrate application of the `fullmatrices=True` and `full-matrices=False` options.
494-
Using `full-matrices=False` returns a reduced singular value decomposition.
498+
The cells above illustrate the application of the `full_matrices=True` and `full_matrices=False` options.
499+
Using `full_matrices=False` returns a reduced singular value decomposition.
495500
496-
The **full** and **reduced** SVd's both accurately decompose an $m \times n$ matrix $X$
501+
The **full** and **reduced** SVD's both accurately decompose an $m \times n$ matrix $X$
497502
498503
When we study Dynamic Mode Decompositions below, it will be important for us to remember the preceding properties of full and reduced SVD's in such tall-skinny cases.
499504
@@ -569,19 +574,90 @@ where for $j = 1, \ldots, n$ the column vector $X_j = \begin{bmatrix}X_{1j}\\X_{
569574
570575
In a **time series** setting, we would think of columns $j$ as indexing different __times__ at which random variables are observed, while rows index different random variables.
571576
572-
In a **cross section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **attributes**.
577+
In a **cross-section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **attributes**.
578+
579+
As we have seen before, the SVD is a way to decompose a matrix into useful components, just like polar decomposition, eigendecomposition, and many others.
580+
581+
PCA, on the other hand, is a method that builds on the SVD to analyze data. The goal is to apply certain steps, to help better visualize patterns in data, using statistical tools to capture the most important patterns in data.
582+
583+
**Step 1: Standardize the data:**
573584
574-
The number of positive singular values equals the rank of matrix $X$.
585+
Because our data matrix may hold variables of different units and scales, we first need to standardize the data.
575586
576-
Arrange the singular values in decreasing order.
587+
First by computing the average of each row of $X$.
577588
578-
Arrange the positive singular values on the main diagonal of the matrix $\Sigma$ of into a vector $\sigma_R$.
589+
$$
590+
\bar{X_j}= \frac{1}{m} \sum_{i = 1}^{m} x_{i,j}
591+
$$
592+
593+
We then create an average matrix out of these means:
594+
595+
596+
$$
597+
\bar{X} = \begin{bmatrix}1 \\1 \\\ldots\\1 \end{bmatrix} \begin{bmatrix} \bar{X_1} \mid \bar{X_2} \mid \cdots \mid \bar{X_n}\end{bmatrix}
598+
$$
599+
600+
And subtract out of the original matrix to create a mean centered matrix:
601+
602+
$$
603+
B = X - \bar{X}
604+
$$
605+
606+
607+
**Step 2: Compute the covariance matrix:**
608+
609+
Then because we want to extract the relationships between variables rather than just their magnitude, in other words, we want to know how they can explain each other, we compute the covariance matrix of $B$.
610+
611+
$$
612+
C = \frac{1}{{n}} B^\top B
613+
$$
614+
615+
**Step 3: Decompose the covariance matrix and arrange the singular values:**
616+
617+
If the matrix $C$ is diagonalizable, we can eigendecompose it, find its eigenvalues and rearrange the eigenvalue and eigenvector matrices in a decreasing other.
618+
619+
If $C$ is not diagonalizable, we can perform an SVD of $C$:
620+
621+
$$
622+
\begin{align}
623+
B^T B &= V \Sigma^\top U^\top U \Sigma V^\top \cr
624+
&= V \Sigma^\top \Sigma V^\top
625+
\end{align}
626+
$$
627+
628+
$$
629+
C = \frac{1}{{n}} V \Sigma^\top \Sigma V^\top
630+
$$
631+
632+
We can then rearrange the columns in the matrices $V$ and $\Sigma$ so that the singular values are in decreasing order.
633+
634+
635+
**Step 4: Select singular values, (optional) truncate the rest:**
636+
637+
We can now decide how many singular values to pick, based on how much variance you want to retain. (e.g., retaining 95% of the total variance).
638+
639+
We can obtain the percentage by calculating the variance contained in the leading $r$ factors divided by the variance in total:
640+
641+
$$
642+
\frac{\sum_{i = 1}^{r} \sigma^2_{i}}{\sum_{i = 1}^{p} \sigma^2_{i}}
643+
$$
644+
645+
**Step 5: Create the Score Matrix:**
646+
647+
$$
648+
\begin{align}
649+
T&= BV \cr
650+
&= U\Sigma V^\top \cr
651+
&= U\Sigma
652+
\end{align}
653+
$$
579654
580-
Set all other entries of $\Sigma$ to zero.
581655
582656
## Relationship of PCA to SVD
583657
584-
To relate a SVD to a PCA (principal component analysis) of data set $X$, first construct the SVD of the data matrix $X$:
658+
To relate an SVD to a PCA of data set $X$, first construct the SVD of the data matrix $X$:
659+
660+
Let’s assume that sample means of all variables are zero, so we don't need to standardize our matrix.
585661
586662
$$
587663
X = U \Sigma V^\top = \sigma_1 U_1 V_1^\top + \sigma_2 U_2 V_2^\top + \cdots + \sigma_p U_p V_p^\top
@@ -763,7 +839,7 @@ $$
763839
X X^\top = U \Sigma \Sigma^\top U^\top
764840
$$
765841
766-
where $U$ is an orthonal matrix.
842+
where $U$ is an orthogonal matrix.
767843
768844
Thus, $P = U$ and we have the representation of $X$
769845
@@ -791,9 +867,11 @@ Below we define a class `DecomAnalysis` that wraps PCA and SVD for a given a da
791867
class DecomAnalysis:
792868
"""
793869
A class for conducting PCA and SVD.
870+
X: data matrix
871+
r_component: chosen rank for best approximation
794872
"""
795873
796-
def __init__(self, X, n_component=None):
874+
def __init__(self, X, r_component=None):
797875
798876
self.X = X
799877
@@ -802,10 +880,10 @@ class DecomAnalysis:
802880
self.m, self.n = X.shape
803881
self.r = LA.matrix_rank(X)
804882
805-
if n_component:
806-
self.n_component = n_component
883+
if r_component:
884+
self.r_component = r_component
807885
else:
808-
self.n_component = self.m
886+
self.r_component = self.m
809887
810888
def pca(self):
811889
@@ -825,8 +903,8 @@ class DecomAnalysis:
825903
# compute the N by T matrix of principal components
826904
self.𝜖 = self.P.T @ self.X
827905
828-
P = self.P[:, :self.n_component]
829-
𝜖 = self.𝜖[:self.n_component, :]
906+
P = self.P[:, :self.r_component]
907+
𝜖 = self.𝜖[:self.r_component, :]
830908
831909
# transform data
832910
self.X_pca = P @ 𝜖
@@ -854,26 +932,26 @@ class DecomAnalysis:
854932
self.explained_ratio_svd = np.cumsum(𝜎_sq) / 𝜎_sq.sum()
855933
856934
# slicing matrices by the number of components to use
857-
U = self.U[:, :self.n_component]
858-
Σ = self.Σ[:self.n_component, :self.n_component]
859-
VT = self.VT[:self.n_component, :]
935+
U = self.U[:, :self.r_component]
936+
Σ = self.Σ[:self.r_component, :self.r_component]
937+
VT = self.VT[:self.r_component, :]
860938
861939
# transform data
862940
self.X_svd = U @ Σ @ VT
863941
864-
def fit(self, n_component):
942+
def fit(self, r_component):
865943
866944
# pca
867-
P = self.P[:, :n_component]
868-
𝜖 = self.𝜖[:n_component, :]
945+
P = self.P[:, :r_component]
946+
𝜖 = self.𝜖[:r_component, :]
869947
870948
# transform data
871949
self.X_pca = P @ 𝜖
872950
873951
# svd
874-
U = self.U[:, :n_component]
875-
Σ = self.Σ[:n_component, :n_component]
876-
VT = self.VT[:n_component, :]
952+
U = self.U[:, :r_component]
953+
Σ = self.Σ[:r_component, :r_component]
954+
VT = self.VT[:r_component, :]
877955
878956
# transform data
879957
self.X_svd = U @ Σ @ VT
@@ -926,6 +1004,58 @@ def compare_pca_svd(da):
9261004
plt.show()
9271005
```
9281006
1007+
## Exercises
1008+
1009+
```{exercise}
1010+
:label: svd_ex1
1011+
1012+
In Ordinary Least Squares (OLS), we learn to compute $ \hat{\beta} = (X^\top X)^{-1} X^\top y $, but there are cases such as when we have colinearity or an underdetermined system: **short fat** matrix.
1013+
1014+
In these cases, the $ (X^\top X) $ matrix is not not invertible (its determinant is zero) or ill-conditioned (its determinant is very close to zero).
1015+
1016+
What we can do instead is to create what is called a [pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse), a full rank approximation of the inverted matrix so we can compute $ \hat{\beta} $ with it.
1017+
1018+
Thinking in terms of the Eckart-Young theorem, build the pseudoinverse matrix $ X^{+} $ and use it to compute $ \hat{\beta} $.
1019+
1020+
```
1021+
1022+
```{solution-start} svd_ex1
1023+
:class: dropdown
1024+
```
1025+
1026+
We can use SVD to compute the pseudoinverse:
1027+
1028+
$$
1029+
X = U \Sigma V^\top
1030+
$$
1031+
1032+
inverting $X$, we have:
1033+
1034+
$$
1035+
X^{+} = V \Sigma^{+} U^\top
1036+
$$
1037+
1038+
where:
1039+
1040+
$$
1041+
\Sigma^{+} = \begin{bmatrix}
1042+
\frac{1}{\sigma_1} & 0 & \cdots & 0 & 0 \\
1043+
0 & \frac{1}{\sigma_2} & \cdots & 0 & 0 \\
1044+
\vdots & \vdots & \ddots & \vdots & \vdots \\
1045+
0 & 0 & \cdots & \frac{1}{\sigma_p} & 0 \\
1046+
0 & 0 & \cdots & 0 & 0 \\
1047+
\end{bmatrix}
1048+
$$
1049+
1050+
and finally:
1051+
1052+
$$
1053+
\hat{\beta} = X^{+}y = V \Sigma^{+} U^\top y
1054+
$$
1055+
1056+
```{solution-end}
1057+
```
1058+
9291059
9301060
For an example PCA applied to analyzing the structure of intelligence tests see this lecture {doc}`Multivariable Normal Distribution <multivariate_normal>`.
9311061

0 commit comments

Comments
 (0)