-
-
Notifications
You must be signed in to change notification settings - Fork 50
Suggestions for the SVD lecture #361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…kart-Young theorem is important. Added details to how the PCA for any given matrix. Changed the n_components term in the codes to r_components, so it is in accord with the definition of the Eckward-Young theorem and doesn't create confusion with the n columns of the matrix. Added an exercise and its solution at the end (I've never used sphinx, so not sure if it's done correctly).
Many thanks @matheusvillasb for your very helpful corrections and suggestions! @HumphreyYang , could you please do a first round review of this PR? @matheusvillasb is a first-time contributor, and hence might need some assistance with QE conventions and myst syntax. @thomassargent30, just looping you into this discussion since the edits are to your SVD lecture. @matheusvillasb is a very enthusiastic and hard working masters student based in Europe. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great changes @matheusvillasb, many thanks.
Please kindly check my comments below and feel free to commit them. I will do another round in a separate PR once these changes are made.
Co-authored-by: Humphrey Yang <[email protected]>
Removed typos and excessive lines that were pointed out by HumphreyYang
just being sure I commited all changes
Thank you so much Humphrey!
Thanks so much for reviewing everything I did and making it more concise! I
committed the changes, there was just the third suggestion where you
suggested me to write "the leading eigenvalues", but in that case
particularly, the theorem works for any matrix not only those that can be
eigendecomposed, so singular values is more appropriate. Other than that, I
think you made everything much better, thanks!
So as I said, I committed the changes and also made a new pull request, I
hope this works.
I'm sorry if I did anything wrong, I'm still figuring out how to do it.
Thank you very much! If you don't mind, I'll start on proofreading the DMD
lectures and proposing exercises and small changes if you don't mind. Or if
you need help with anything else, I'm just really glad to help!!
Matheus
Le jeu. 17 août 2023 à 08:38, Humphrey Yang ***@***.***> a
écrit :
… ***@***.**** requested changes on this pull request.
Great changes @matheusvillasb <https://github.com/matheusvillasb>,
Please kindly check my comments above and feel free to commit them. I will
do another round in a separate PR once these changes are made.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -350,7 +350,7 @@ of dimension $m \times n$.
Three popular **matrix norms** of an $m \times n$ matrix $X$ can be expressed in terms of the singular values of $X$
-* the **spectral** or $l^2$ norm $|| X ||_2 = \max_{y \in \textbf{R}^n} \frac{||X y ||}{||y||} = \sigma_1$
+* the **spectral** or $l^2$ norm $|| X ||_2 = \max_{y \neq 0} \frac{||X y ||}{||y||} = \sigma_1$
⬇️ Suggested change
-* the **spectral** or $l^2$ norm $|| X ||_2 = \max_{y \neq 0} \frac{||X y ||}{||y||} = \sigma_1$
+* the **spectral** or $l^2$ norm $|| X ||_2 = \max_{||y|| \neq 0} \frac{||X y ||}{||y||} = \sigma_1$
Nice catch! I think it is clearer if we say $||y|| \neq 0$ instead of $y
\neq 0$.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -360,6 +360,13 @@ $$
\hat X_r = \sigma_1 U_1 V_1^\top + \sigma_2 U_2 V_2^\top + \cdots + \sigma_r U_r V_r^\top
$$ (eq:Ekart)
+This is a very powerful theorem, it says that we can take our $ m \times n $ matrix $X$ that in not full rank, and we can best approximate it to a full rank $p_x p$ matrix through the SVD.
⬇️ Suggested change
-This is a very powerful theorem, it says that we can take our $ m \times n $ matrix $X$ that in not full rank, and we can best approximate it to a full rank $p_x p$ matrix through the SVD.
+This is a very powerful theorem, it says that we can take our $ m \times n $ matrix $X$ that in not full rank, and we can best approximate it to a full rank $p \times p$ matrix through the SVD.
I think you are suggesting $p \times p$ here.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -360,6 +360,13 @@ $$
\hat X_r = \sigma_1 U_1 V_1^\top + \sigma_2 U_2 V_2^\top + \cdots + \sigma_r U_r V_r^\top
$$ (eq:Ekart)
+This is a very powerful theorem, it says that we can take our $ m \times n $ matrix $X$ that in not full rank, and we can best approximate it to a full rank $p_x p$ matrix through the SVD.
+
+Moreover, if some of these $p$ singular values carry more information than others, and if we want to have the most amount of information with the least amount of data, we can order these singular values in a decreasing order by magnitude and set a threshold $r$, from where past this point we set all singular values to zero.
This is a great addition, but we often make sentences short in lectures. I
propose we shorten it as
⬇️ Suggested change
-Moreover, if some of these $p$ singular values carry more information than others, and if we want to have the most amount of information with the least amount of data, we can order these singular values in a decreasing order by magnitude and set a threshold $r$, from where past this point we set all singular values to zero.
+Moreover, if some of these $p$ singular values carry more information than others, and if we want to have the most amount of information with the least amount of data, we can take $r$ leading eigenvalues ordered by magnitude.
Please feel free to take or leave this suggestion.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -360,6 +360,13 @@ $$
\hat X_r = \sigma_1 U_1 V_1^\top + \sigma_2 U_2 V_2^\top + \cdots + \sigma_r U_r V_r^\top
$$ (eq:Ekart)
+This is a very powerful theorem, it says that we can take our $ m \times n $ matrix $X$ that in not full rank, and we can best approximate it to a full rank $p_x p$ matrix through the SVD.
+
+Moreover, if some of these $p$ singular values carry more information than others, and if we want to have the most amount of information with the least amount of data, we can order these singular values in a decreasing order by magnitude and set a threshold $r$, from where past this point we set all singular values to zero.
+
+This is what model reduction is about, we project the data into a new space, where we extract the patterns that are behind this data, and then we can then keep most of the important patterns and truncate the rest.
I think the previous sentence is very clear already. Would you suggest
leaving this out for simplicity?
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -569,19 +576,78 @@ where for $j = 1, \ldots, n$ the column vector $X_j = \begin{bmatrix}X_{1j}\\X_{
In a **time series** setting, we would think of columns $j$ as indexing different __times__ at which random variables are observed, while rows index different random variables.
-In a **cross section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **attributes**.
+In a **cross-section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **attributes**.
+
+As we have seen before, the SVD is a way to decompose a matrix into useful components, just like polar decomposition, eigen decomposition and many others. PCA on the other hand, is method that builds on the SVD, to analyse data. The goal is to apply certain steps, to help better visualize patterns in data, using statistical tools to capture the most important patterns in data.
⬇️ Suggested change
-As we have seen before, the SVD is a way to decompose a matrix into useful components, just like polar decomposition, eigen decomposition and many others. PCA on the other hand, is method that builds on the SVD, to analyse data. The goal is to apply certain steps, to help better visualize patterns in data, using statistical tools to capture the most important patterns in data.
+As we have seen before, the SVD is a way to decompose a matrix into useful components, just like polar decomposition, eigendecomposition, and many others.
+
+PCA, on the other hand, is a method that builds on the SVD to analyze data. The goal is to apply certain steps, to help better visualize patterns in data, using statistical tools to capture the most important patterns in data.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -360,6 +360,13 @@ $$
\hat X_r = \sigma_1 U_1 V_1^\top + \sigma_2 U_2 V_2^\top + \cdots + \sigma_r U_r V_r^\top
$$ (eq:Ekart)
+This is a very powerful theorem, it says that we can take our $ m \times n $ matrix $X$ that in not full rank, and we can best approximate it to a full rank $p_x p$ matrix through the SVD.
+
+Moreover, if some of these $p$ singular values carry more information than others, and if we want to have the most amount of information with the least amount of data, we can order these singular values in a decreasing order by magnitude and set a threshold $r$, from where past this point we set all singular values to zero.
+
+This is what model reduction is about, we project the data into a new space, where we extract the patterns that are behind this data, and then we can then keep most of the important patterns and truncate the rest.
+
+But more about it later when we present Principal Component Analysis.
You can read about the Eckart-Young theorem and some of its uses here <https://en.wikipedia.org/wiki/Low-rank_approximation>.
⬇️ Suggested change
-You can read about the Eckart-Young theorem and some of its uses here <https://en.wikipedia.org/wiki/Low-rank_approximation>.
+You can read about the Eckart-Young theorem and some of its uses [here](https://en.wikipedia.org/wiki/Low-rank_approximation).
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> The cells above illustrate application of the `fullmatrices=True` and `full-matrices=False` options.
Using `full-matrices=False` returns a reduced singular value decomposition.
⬇️ Suggested change
-The cells above illustrate application of the `fullmatrices=True` and `full-matrices=False` options.
-Using `full-matrices=False` returns a reduced singular value decomposition.
+The cells above illustrate the application of the `full_matrices=True` and `full_matrices=False` options.
+Using `full_matrices=False` returns a reduced singular value decomposition.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -569,19 +576,78 @@ where for $j = 1, \ldots, n$ the column vector $X_j = \begin{bmatrix}X_{1j}\\X_{
In a **time series** setting, we would think of columns $j$ as indexing different __times__ at which random variables are observed, while rows index different random variables.
-In a **cross section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **attributes**.
+In a **cross-section** setting, we would think of columns $j$ as indexing different __individuals__ for which random variables are observed, while rows index different **attributes**.
+
+As we have seen before, the SVD is a way to decompose a matrix into useful components, just like polar decomposition, eigen decomposition and many others. PCA on the other hand, is method that builds on the SVD, to analyse data. The goal is to apply certain steps, to help better visualize patterns in data, using statistical tools to capture the most important patterns in data.
+
+**Step 1: Standardize the data:** Because our data matrix may hold variables of different units and scales like mentioned above, we first need to standardize the data. First by computing the average of each row of $X$.
⬇️ Suggested change
-**Step 1: Standardize the data:** Because our data matrix may hold variables of different units and scales like mentioned above, we first need to standardize the data. First by computing the average of each row of $X$.
+**Step 1: Standardize the data:**
+
+Because our data matrix may hold variables of different units and scales, we first need to standardize the data.
+
+First by computing the average of each row of $X$.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
>
-The number of positive singular values equals the rank of matrix $X$.
+**Step 2: Compute the covariance matrix:** Then because we want to extract the relationships between variables rather than just their magnitude, in other words, we want to know how they can explain each other, we compute the covariance matrix of $B$.
⬇️ Suggested change
-**Step 2: Compute the covariance matrix:** Then because we want to extract the relationships between variables rather than just their magnitude, in other words, we want to know how they can explain each other, we compute the covariance matrix of $B$.
+**Step 2: Compute the covariance matrix:**
+
+Then because we want to extract the relationships between variables rather than just their magnitude, in other words, we want to know how they can explain each other, we compute the covariance matrix of $B$.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
>
-The number of positive singular values equals the rank of matrix $X$.
+**Step 2: Compute the covariance matrix:** Then because we want to extract the relationships between variables rather than just their magnitude, in other words, we want to know how they can explain each other, we compute the covariance matrix of $B$.
+
+$$
+C = \frac{1}{{n}} B^T B
⬇️ Suggested change
-C = \frac{1}{{n}} B^T B
+C = \frac{1}{{n}} B^\top B
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
>
-The number of positive singular values equals the rank of matrix $X$.
+**Step 2: Compute the covariance matrix:** Then because we want to extract the relationships between variables rather than just their magnitude, in other words, we want to know how they can explain each other, we compute the covariance matrix of $B$.
+
+$$
+C = \frac{1}{{n}} B^T B
+$$
+
+**Step 3: Decompose the covariance matrix and arrange the singular values:**
+
+If the matrix $C$ is diagonalizable, we can eigendecompose it, find its eigenvalues and rearrange the eigenvalue and eigenvector matrices in a decreasing other. If $C$ is not diagonalizable, we can perform an SVD of $C$:
⬇️ Suggested change
-If the matrix $C$ is diagonalizable, we can eigendecompose it, find its eigenvalues and rearrange the eigenvalue and eigenvector matrices in a decreasing other. If $C$ is not diagonalizable, we can perform an SVD of $C$:
+If the matrix $C$ is diagonalizable, we can eigendecompose it, find its eigenvalues and rearrange the eigenvalue and eigenvector matrices in a decreasing other.
+
+If $C$ is not diagonalizable, we can perform an SVD of $C$:
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
>
## Relationship of PCA to SVD
-To relate a SVD to a PCA (principal component analysis) of data set $X$, first construct the SVD of the data matrix $X$:
+To relate an SVD to a PCA (principal component analysis) of data set $X$, first construct the SVD of the data matrix $X$:
⬇️ Suggested change
-To relate an SVD to a PCA (principal component analysis) of data set $X$, first construct the SVD of the data matrix $X$:
+To relate an SVD to a PCA of data set $X$, first construct the SVD of the data matrix $X$:
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> +```{code-cell} python3
+
+We can use SVD to compute the pseudoinverse:
+
+$$
+X = U \Sigma V^\top
+$$
+
+inverting $X$, we have:
+
+$$
+X^{+} = V \Sigma^{+} U^\top
+$$
+
+where:
+
+$$
+\Sigma^{+} \Sigma = \begin{bmatrix} I_p & 0 \cr 0 & 0 \end{bmatrix}
+$$
+
+and finally:
+
+$$
+\hat{\beta} = X^{+}y = V \Sigma^{+} U^\top y
+$$
+
I think this should not be in a code block.
Please remove
```{code-cell} python3
...
```
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
>
-Arrange the positive singular values on the main diagonal of the matrix $\Sigma$ of into a vector $\sigma_R$.
+**Step 4: Select singular values, (optional) truncate the rest:**
+
+We can now decide how many singular values to pick, based on how much variance you want to retain. (e.g., retaining 95% of the total variance).
+
+$$
+\frac{\sum_{i = 1}^{r} \sigma^2_{i}}{\sum_{i = 1}^{p} \sigma^2_{i}}
+$$
+
+**Step 5: Create the Score Matrix:
⬇️ Suggested change
-**Step 5: Create the Score Matrix:
+**Step 5: Create the Score Matrix:**
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
>
-Arrange the positive singular values on the main diagonal of the matrix $\Sigma$ of into a vector $\sigma_R$.
+**Step 4: Select singular values, (optional) truncate the rest:**
+
+We can now decide how many singular values to pick, based on how much variance you want to retain. (e.g., retaining 95% of the total variance).
⬇️ Suggested change
-We can now decide how many singular values to pick, based on how much variance you want to retain. (e.g., retaining 95% of the total variance).
+We can now decide how many singular values to pick, based on how much variance you want to retain. (e.g., retaining 95% of the total variance).
+
+We can obtain the percentage by calculating the variance contained in the leading $r$ factors divided by the variance in total:
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -926,6 +994,56 @@ def compare_pca_svd(da):
plt.show()
```
+## Exercises
+
+```{exercise}
+:label: svd_ex1
+
+In Ordinary Least Squares (OLS), we learn to compute $ \hat{\beta} = (X^T X)^{-1} X^T y $, but there are cases such as when we have colinearity or an underdetermined system: **short fat** matrix.
+
+In these cases, the $ (X^T X) $ matrix is not inversible. Its determinant is zero or close to zero and we cannot invert it.
⬇️ Suggested change
-In these cases, the $ (X^T X) $ matrix is not inversible. Its determinant is zero or close to zero and we cannot invert it.
+In these cases, the $ (X^\top X) $ matrix is not inversible. Its determinant is zero or close to zero and we cannot invert it.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -926,6 +994,56 @@ def compare_pca_svd(da):
plt.show()
```
+## Exercises
+
+```{exercise}
+:label: svd_ex1
+
+In Ordinary Least Squares (OLS), we learn to compute $ \hat{\beta} = (X^T X)^{-1} X^T y $, but there are cases such as when we have colinearity or an underdetermined system: **short fat** matrix.
⬇️ Suggested change
-In Ordinary Least Squares (OLS), we learn to compute $ \hat{\beta} = (X^T X)^{-1} X^T y $, but there are cases such as when we have colinearity or an underdetermined system: **short fat** matrix.
+In Ordinary Least Squares (OLS), we learn to compute $ \hat{\beta} = (X^\top X)^{-1} X^\top y $, but there are cases such as when we have colinearity or an underdetermined system: **short fat** matrix.
------------------------------
In lectures/svd_intro.md
<#361 (comment)>
:
> @@ -926,6 +994,56 @@ def compare_pca_svd(da):
plt.show()
```
+## Exercises
+
+```{exercise}
+:label: svd_ex1
+
+In Ordinary Least Squares (OLS), we learn to compute $ \hat{\beta} = (X^T X)^{-1} X^T y $, but there are cases such as when we have colinearity or an underdetermined system: **short fat** matrix.
+
+In these cases, the $ (X^T X) $ matrix is not inversible. Its determinant is zero or close to zero and we cannot invert it.
+
+What we can do instead is to create what is called a pseudoinverse, a full rank approximation of the inverted matrix so we can compute $ \hat{\beta} $ with it.
⬇️ Suggested change
-What we can do instead is to create what is called a pseudoinverse, a full rank approximation of the inverted matrix so we can compute $ \hat{\beta} $ with it.
+What we can do instead is to create what is called a [pseudoinverse](https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_inverse), a full rank approximation of the inverted matrix so we can compute $ \hat{\beta} $ with it.
—
Reply to this email directly, view it on GitHub
<#361 (review)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BB6AUEZGSBNNGM7ERY2VXK3XVW373ANCNFSM6AAAAAA3TB6O5I>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Hi Matheus,
You are absolutely right. That was a typo from me.
No, you are doing great. Thanks for the great contribution. Please feel free to go ahead with the DMD lecture in a separate PR. Please do not hesitate to let me know if you need any help. Thanks, |
lectures/svd_intro.md
Outdated
@@ -360,8 +360,13 @@ $$ | |||
\hat X_r = \sigma_1 U_1 V_1^\top + \sigma_2 U_2 V_2^\top + \cdots + \sigma_r U_r V_r^\top | |||
$$ (eq:Ekart) | |||
|
|||
This is a very powerful theorem, it says that we can take our $ m \times n $ matrix $X$ that in not full rank, and we can best approximate it to a full rank $p \times p$ matrix through the SVD. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please changes as follows
"...powerful theorem that says..."
"...approximate it by a..."
lectures/svd_intro.md
Outdated
You can read about the Eckart-Young theorem and some of its uses here <https://en.wikipedia.org/wiki/Low-rank_approximation>. | ||
Moreover, if some of these $p$ singular values carry more information than others, and if we want to have the most amount of information with the least amount of data, we can take $r$ leading singular values ordered by magnitude. | ||
|
||
But more about it later when we present Principal Component Analysis. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"We'll say more about this later when..."
lectures/svd_intro.md
Outdated
|
||
In Ordinary Least Squares (OLS), we learn to compute $ \hat{\beta} = (X^\top X)^{-1} X^\top y $, but there are cases such as when we have colinearity or an underdetermined system: **short fat** matrix. | ||
|
||
In these cases, the $ (X^\top X) $ matrix is not inversible. Its determinant is zero or close to zero and we cannot invert it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"...not invertible (its determinant is zero) or ill-conditioned (its determinant is very close to zero)."
Thanks @matheusvillasb , these are nice changes. Thoughtful and well written. I've requested some very minor edits. Would you mind to make those changes and push them to this PR? Once you have made those edits to the PR I'll pass the review over to @mmcky and @thomassargent30 so we can get this merged. (Thanks also to @HumphreyYang for a very useful review.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did the changes John asked, sorry for the delay, I thought I had commited it before, but it didn't.
Great, thanks @matheusvillasb . Over to you @mmcky and @thomassargent30 . |
corrected definition of pseudo inverse of sigma matrix in the exercise
@HumphreyYang , could you please look and see why this is failing? |
Hi @jstac, I think @mmcky raised this issue of GitHub action. It rejects running PR from a fork as it is reluctant to share the credentials for EC2. I think we have yet to find a good solution other than building it locally (CC @mmcky). |
@HumphreyYang can you run this locally and cross-check the builds and report back. It would be great if you can also fix the merge conflict. |
Alternatively please transfer the PR to a local branch and push (giving @matheusvillasb the credit to get |
Closing this in preference for #375 (migrated to a local branch) |
I have re-opened this as we can use #375 as a test environment so long as we move those changes back to this PR (that can't execute the previews). @HumphreyYang makes a good point in that we want attribution to be retained for @matheusvillasb so it will be better to merge this into @HumphreyYang can you use
to bring this PR (fork in line with #375) |
@HumphreyYang I see the fix 0770ded and I can apply this if you're busy today. 👍 |
@jstac a replica of this is available here: #375 and a preview https://65782eb0f8c9513bb97425e7--nostalgic-wright-5fa355.netlify.app/svd_intro.html |
Hi @mmcky, thanks for organizing. Thanks @matheusvillasb for putting this together and @HumphreyYang for the review. I'm happy with these changes but they should be approved by @thomassargent30 before being made live. |
thanks @matheusvillasb for these proposed changes. Thanks @HumphreyYang @jstac for your comments and reviews. @thomassargent30 has reviewed and will make a few minor changes once merged into the main branch. |
thanks once again @matheusvillasb -- these changes are going live today. |
Corrected some typos.
Added a brief description explaining why the Eckart-Young theorem is important.
Added details to how the PCA for any given matrix.
Changed the n_components term in the codes to r_components, so it is in accord with the definition of the Eckward-Young theorem and doesn't create confusion with the n columns of the matrix.
Added an exercise and its solution at the end (I've never used Sphinx, so not sure if it's done correctly).