Merge pull request #274 from UBC-DSCI/various-fixes

trevorcampbell · web-flow · commit c684db8a1257 · 2023-11-10T16:22:07.000-08:00
Various fixes
diff --git a/build_html.sh b/build_html.sh
@@ -1,2 +1,2 @@
 chmod -R o+w source/
-docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231108192908c9b484 /bin/bash -c "jupyter-book build source"
+docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231110054348fd23c8 /bin/bash -c "jupyter-book build source"
diff --git a/build_pdf.sh b/build_pdf.sh
@@ -1,2 +1,2 @@
 chmod -R o+w source/
-docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231108192908c9b484 /bin/bash -c "export BOOK_BUILD_TYPE='PDF'; jupyter-book build source --builder pdflatex"
+docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231110054348fd23c8 /bin/bash -c "export BOOK_BUILD_TYPE='PDF'; jupyter-book build source --builder pdflatex"
diff --git a/source/classification1.md b/source/classification1.md
@@ -278,7 +278,7 @@ cancer["Class"].value_counts(normalize=True)
 ```
 
 Next, let's draw a colored scatter plot to visualize the relationship between the
-perimeter and concavity variables. Recall that `altair's` default palette
+perimeter and concavity variables. Recall that the default palette in `altair`
 is colorblind-friendly, so we can stick with that here.
 
 ```{code-cell} ipython3
@@ -332,7 +332,7 @@ points_df = pd.DataFrame(
 )
 perim_concav_with_new_point_df = pd.concat((cancer, points_df), ignore_index=True)
 # Find the euclidean distances from the new point to each of the points
-# in the orginal dataset
+# in the orginal data set
 my_distances = euclidean_distances(perim_concav_with_new_point_df[attrs])[
     len(cancer)
 ][:-1]
@@ -430,7 +430,7 @@ points_df2 = pd.DataFrame(
 )
 perim_concav_with_new_point_df2 = pd.concat((cancer, points_df2), ignore_index=True)
 # Find the euclidean distances from the new point to each of the points
-# in the orginal dataset
+# in the orginal data set
 my_distances2 = euclidean_distances(perim_concav_with_new_point_df2[attrs])[
     len(cancer)
 ][:-1]
@@ -639,6 +639,32 @@ new_obs_Concavity = 3.5
 )
 ```
 
+```{code-cell} ipython3
+:tags: [remove-cell]
+# code needed to render the latex table with distance calculations 
+from IPython.display import Latex
+five_neighbors = (
+    cancer
+   [["Perimeter", "Concavity", "Class"]]
+   .assign(dist_from_new = (
+       (cancer["Perimeter"] - new_obs_Perimeter) ** 2
+     + (cancer["Concavity"] - new_obs_Concavity) ** 2
+   )**(1/2))
+   .nsmallest(5, "dist_from_new")
+).reset_index()
+
+for i in range(5):
+    glue(f"gn{i}_perim", "{:0.2f}".format(five_neighbors["Perimeter"][i]))
+    glue(f"gn{i}_concav", "{:0.2f}".format(five_neighbors["Concavity"][i]))
+    glue(f"gn{i}_class", five_neighbors["Class"][i])
+
+    # typeset perimeter,concavity with parentheses if negative for latex
+    nperim = f"{five_neighbors['Perimeter'][i]:.2f}" if five_neighbors['Perimeter'][i] > 0 else f"({five_neighbors['Perimeter'][i]:.2f})"
+    nconcav = f"{five_neighbors['Concavity'][i]:.2f}" if five_neighbors['Concavity'][i] > 0 else f"({five_neighbors['Concavity'][i]:.2f})"
+
+    glue(f"gdisteqn{i}", Latex(f"\sqrt{{(0-{nperim})^2+(3.5-{nconcav})^2}}={five_neighbors['dist_from_new'][i]:.2f}"))
+```
+
 In {numref}`tab:05-multiknn-mathtable` we show in mathematical detail how
 we computed the `dist_from_new` variable (the
 distance to the new observation) for each of the 5 nearest neighbors in the
@@ -648,11 +674,11 @@ training data.
 :name: tab:05-multiknn-mathtable
 | Perimeter | Concavity | Distance            | Class |
 |-----------|-----------|----------------------------------------|-------|
-| 0.24      | 2.65      | $\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88$| Benign     |
-| 0.75      | 2.87      | $\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98$| Malignant     |
-| 0.62      | 2.54      | $\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14$| Malignant     |
-| 0.42      | 2.31      | $\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26$| Malignant     |
-| -1.16     | 4.04      | $\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28$| Benign     |
+| {glue:text}`gn0_perim`  | {glue:text}`gn0_concav`  | {glue:}`gdisteqn0` | {glue:text}`gn0_class`     |
+| {glue:text}`gn1_perim`  | {glue:text}`gn1_concav`  | {glue:}`gdisteqn1` | {glue:text}`gn1_class`     |
+| {glue:text}`gn2_perim`  | {glue:text}`gn2_concav`  | {glue:}`gdisteqn2` | {glue:text}`gn2_class`     |
+| {glue:text}`gn3_perim`  | {glue:text}`gn3_concav`  | {glue:}`gdisteqn3` | {glue:text}`gn3_class`     |
+| {glue:text}`gn4_perim`  | {glue:text}`gn4_concav`  | {glue:}`gdisteqn4` | {glue:text}`gn4_class`     |
 ```
 
 +++
@@ -757,7 +783,7 @@ points_df4 = pd.DataFrame(
 )
 perim_concav_with_new_point_df4 = pd.concat((cancer, points_df4), ignore_index=True)
 # Find the euclidean distances from the new point to each of the points
-# in the orginal dataset
+# in the orginal data set
 my_distances4 = euclidean_distances(perim_concav_with_new_point_df4[attrs])[
     len(cancer)
 ][:-1]
diff --git a/source/clustering.md b/source/clustering.md
@@ -18,6 +18,10 @@ kernelspec:
 ```{code-cell} ipython3
 :tags: [remove-cell]
 
+# get rid of futurewarnings from sklearn kmeans
+import warnings
+warnings.simplefilter(action='ignore', category=FutureWarning) 
+
 from chapter_preamble import *
 ```
 
@@ -391,8 +395,10 @@ that we learned about in {numref}`Chapter %s <classification1>`.
 In the {glue:text}`clus_rows_glue`-observation cluster example above,
 we would compute the WSSD $S^2$ via
 
-
-$S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right) + \left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right)  +  \left((x_4 - \mu_x)^2 + (y_4 - \mu_y)^2\right)$
+$$
+S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right)\\
+ + \left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right)  +  \left((x_4 - \mu_x)^2 + (y_4 - \mu_y)^2\right)
+$$
 
 These distances are denoted by lines in {numref}`toy-example-clus1-dists` for the first cluster of the penguin data example.
 
@@ -786,7 +792,7 @@ Total WSSD for K clusters ranging from 1 to 9.
 We can perform K-means in Python using a workflow similar to those
 in the earlier classification and regression chapters. We will begin
 by reading the original (i.e., unstandardized) subset of 18 observations
-from the penguins dataset.
+from the penguins data set.
 
 ```{code-cell} ipython3
 :tags: [remove-cell]
@@ -1050,7 +1056,7 @@ and guidance that the worksheets provide will function as intended.
   clustering for when you expect there to be subgroups, and then subgroups within
   subgroups, etc., in your data. In the realm of more general unsupervised
   learning, it covers *principal components analysis (PCA)*, which is a very
-  popular technique for reducing the number of predictors in a dataset.
+  popular technique for reducing the number of predictors in a data set.
 
 ## References
 
diff --git a/source/inference.md b/source/inference.md
@@ -1197,12 +1197,6 @@ sample. Since the bootstrap distribution pretty well approximates the sampling
 distribution spread, we can use the bootstrap spread to help us develop a
 plausible range for our population parameter along with our estimate!
 
-```{code-cell} ipython3
-:tags: [remove-cell]
-
-!wget -O img/inference/11-bootstrapping7-1.png https://datasciencebook.ca/_main_files/figure-html/11-bootstrapping7-1.png
-```
-
 ```{figure} img/inference/11-bootstrapping7-1.png
 :name: fig:11-bootstrapping7
 
@@ -1244,7 +1238,7 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol
 To do this in Python, we can use the `quantile` function of our DataFrame.
 Quantiles are expressed in proportions rather than percentages,
 so the 2.5th and 97.5th percentiles
-would be quantiles 0.025 and 0.975, respectively.
+would be the 0.025 and 0.975 quantiles, respectively.
 
 ```{index} numpy; percentile, pandas.DataFrame; df[]
 ```
@@ -1257,8 +1251,8 @@ ci_bounds
 ```{code-cell} ipython3
 :tags: [remove-cell]
 
-glue("ci_lower", "{:.1f}".format(ci_bounds[0.025]))
-glue("ci_upper", "{:.1f}".format(ci_bounds[0.975]))
+glue("ci_lower", "{:.2f}".format(ci_bounds[0.025]))
+glue("ci_upper", "{:.2f}".format(ci_bounds[0.975]))
 ```
 
 Our interval, \${glue:text}`ci_lower` to \${glue:text}`ci_upper`, captures
@@ -1306,7 +1300,7 @@ estimate and our confidence interval's lower and upper bounds. Here the sample
 mean price-per-night of 40 Airbnb listings was
 \${glue:text}`one_sample_mean`, and we are 95\% "confident" that the true
 population mean price-per-night for all Airbnb listings in Vancouver is between
-\$({glue:text}`ci_lower`, {glue:text}`ci_upper`).
+\${glue:text}`ci_lower` and \${glue:text}`ci_upper`.
 Notice that our interval does indeed contain the true
 population mean value, \${glue:text}`population_mean`\! However, in
 practice, we would not know whether our interval captured the population
diff --git a/source/reading.md b/source/reading.md
@@ -80,22 +80,21 @@ functions, we first need to talk about *where* the data lives. When you load a
 data set into Python, you first need to tell Python where those files live. The file
 could live on your computer (*local*) or somewhere on the internet (*remote*).
 
-The place where the file lives on your computer is called the "path". You can
+The place where the file lives on your computer is referred to as its "path". You can
 think of the path as directions to the file. There are two kinds of paths:
-*relative* paths and *absolute* paths. A relative path is where the file is
-with respect to where you currently are on the computer (e.g., where the file
-you're working in is). On the other hand, an absolute path is where the file is
-in respect to the computer's filesystem base (or root) folder.
+*relative* paths and *absolute* paths. A relative path indicates where the file is
+with respect to your *working directory* (i.e., "where you are currently") on the computer. 
+On the other hand, an absolute path indicates where the file is
+with respect to the computer's filesystem base (or *root*) folder, regardless of where you are working.
 
 ```{index} Happiness Report
 ```
 
 Suppose our computer's filesystem looks like the picture in
-{numref}`Filesystem`, and we are working in a
-file titled `worksheet_02.ipynb`. If we want to
-read the `.csv` file named `happiness_report.csv` into Python, we could do this
-using either a relative or an absolute path.  We show both choices
-below.
+{numref}`Filesystem`. We are working in a
+file titled `worksheet_02.ipynb`, and our current working directory is `worksheet_02`;
+typically, as is the case here, the working directory is the directory containing the file you are currently
+working on.
 
 ```{figure} img/reading/filesystem.jpeg
 ---
@@ -105,34 +104,42 @@ name: Filesystem
 Example file system
 ```
 
-
-**Reading `happiness_report.csv` using a relative path:**
-
-+++
-
+Let's say we wanted to open the `happiness_report.csv` file. We have two options to indicate
+where the file is: using a relative path, or using an absolute path.
+The absolute path of the file always starts with a slash `/`&mdash;representing the root folder on the computer&mdash;and
+proceeds by listing out the sequence of folders you would have to enter to reach the file, each separated by another slash `/`.
+So in this case, `happiness_report.csv` would be reached by starting at the root, and entering the `home` folder,
+then the `dsci-100` folder, then the `worksheet_02` folder, and then finally the `data` folder. So its absolute
+path would be `/home/dsci-100/worksheet_02/data/happiness_report.csv`. We can load the file using its absolute path
+as a string passed to the `read_csv` function from `pandas`. 
 ```python
-happy_data = pd.read_csv("data/happiness_report.csv")
+happy_data = pd.read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
 ```
-
-+++
-
-**Reading `happiness_report.csv` using an absolute path:**
-
-+++
-
+If we instead wanted to use a relative path, we would need to list out the sequence of steps needed to get from our current
+working directory to the file, with slashes `/` separating each step. Since we are currently in the `worksheet_02` folder,
+we just need to enter the `data` folder to reach our desired file. Hence the relative path is `data/happiness_report.csv`,
+and we can load the file using its relative path as a string passed to `read_csv`.
 ```python
-happy_data = pd.read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
+happy_data = pd.read_csv("data/happiness_report.csv")
 ```
+Note that there is no forward slash at the beginning of a relative path; if we accidentally typed `"/data/happiness_report.csv"`,
+Python would look for a folder named `data` in the root folder of the computer&mdash;but that doesn't exist!
 
-+++
+Aside from specifying places to go in a path using folder names (like `data` and `worksheet_02`), we can also specify two additional
+special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and 
+the previous directory with two dots `..`. So for instance, if we wanted to reach the `bike_share.csv` file from the `worksheet_02` folder, we could
+use the relative path `../tutorial_01/bike_share.csv`. We can even combine these two; for example, we could reach the `bike_share.csv` file using
+the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`, 
+then go back a folder again, then open `tutorial_01` again, then stay in the current directory, then finally get to `bike_share.csv`. Whew, what a long trip!
 
-So which one should you use? Generally speaking, to ensure your code can be run
-on a different computer, you should use relative paths. An added bonus is that
-it's also less typing! Generally, you should use relative paths because the file's
-absolute path (the names of
-folders between the computer's root `/` and the file) isn't usually the same
-across different computers. For example, suppose Fatima and Jayden are working on a
-project together on the `happiness_report.csv` data. Fatima's file is stored at
+So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths. 
+Using a relative path helps ensure that your code can be run 
+on a different computer (and as an added bonus, relative paths are often shorter&mdash;easier to type!).
+This is because a file's relative path is often the same across different computers, while a
+file's absolute path (the names of 
+all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same 
+across different computers. For example, suppose Fatima and Jayden are working on a 
+project together on the `happiness_report.csv` data. Fatima's file is stored at 
 
 ```
 /home/Fatima/project/data/happiness_report.csv
@@ -150,16 +157,19 @@ their different usernames.  If Jayden has code that loads the
 `happiness_report.csv` data using an absolute path, the code won't work on
 Fatima's computer.  But the relative path from inside the `project` folder
 (`data/happiness_report.csv`) is the same on both computers; any code that uses
-relative paths will work on both!
+relative paths will work on both! In the additional resources section, 
+we include a link to a short video on the
+difference between absolute and relative paths.
 
 ```{index} URL
 ```
 
-Your file could be stored locally, as we discussed, or it could also be
-somewhere on the internet (remotely). For this purpose we use a
+Beyond files stored on your computer (i.e., locally), we also need a way to locate resources
+stored elsewhere on the internet (i.e., remotely). For this purpose we use a
 *Uniform Resource Locator (URL)*, i.e., a web address that looks something
-like https://google.com/. URLs indicate the location of a resource on the internet and
-helps us retrieve that resource.
+like https://datasciencebook.ca/. URLs indicate the location of a resource on the internet, and
+start with a web domain, followed by a forward slash `/`, and then a path
+to where the resource is located on the remote machine.
 
 ## Reading tabular data from a plain text file into Python
 
diff --git a/source/regression1.md b/source/regression1.md
@@ -54,7 +54,7 @@ By the end of the chapter, readers will be able to do the following:
 * Recognize situations where a simple regression analysis would be appropriate for making predictions.
 * Explain the K-nearest neighbor (KNN) regression algorithm and describe how it differs from KNN classification.
 * Interpret the output of a KNN regression.
-* In a dataset with two or more variables, perform K-nearest neighbor regression in Python using a `scikit-learn` workflow.
+* In a data set with two or more variables, perform K-nearest neighbor regression in Python using a `scikit-learn` workflow.
 * Execute cross-validation in Python to choose the number of neighbors.
 * Evaluate KNN regression prediction accuracy in Python using a test data set and the root mean squared prediction error (RMSPE).
 * In the context of KNN regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
@@ -644,8 +644,8 @@ Alright, now the `mean_test_score` variable actually has values of the RMSPE
 for different numbers of neighbors. Finally, the `sem_test_score` variable
 contains the standard error of our cross-validation RMSPE estimate, which
 is a measure of how uncertain we are in the mean value. Roughly, if
-your estimated mean RMSPE is 100,000 and standard error is 1,000, you can expect the
-*true* RMSPE to be somewhere roughly between 99,000 and 101,000 (although it
+your estimated mean RMSPE is \$100,000 and standard error is \$1,000, you can expect the
+*true* RMSPE to be somewhere roughly between \$99,000 and \$101,000 (although it
 may fall outside this range).
 
 {numref}`fig:07-choose-k-knn-plot` visualizes how the RMSPE varies with the number of neighbors $K$.
@@ -795,8 +795,8 @@ In this case the orange line becomes extremely smooth, and actually becomes flat
 once $K$ is equal to the number of datapoints in the entire data set.
 This happens because our predicted values for a given x value (here, home
 size), depend on many neighboring observations; in the case where $K$ is equal
-to the size of the dataset, the prediction is just the mean of the house prices
-in the dataset (completely ignoring the house size).
+to the size of the data set, the prediction is just the mean of the house prices
+in the data set (completely ignoring the house size).
 In contrast to the $K=1$ example,
 the smooth, inflexible orange line does not follow the training observations very closely.
 In other words, the model is *not influenced enough* by the training data.
@@ -1057,11 +1057,11 @@ Here we see that the smallest estimated RMSPE from cross-validation occurs when
 If we want to compare this multivariable KNN regression model to the model with only a single
 predictor *as part of the model tuning process* (e.g., if we are running forward selection as described
 in the chapter on evaluating and tuning classification models),
-then we must compare the accuracy estimated using only the training data via cross-validation.
-Looking back, the estimated cross-validation accuracy for the single-predictor
-model was {glue:text}`cv_RMSPE`.
-The estimated cross-validation accuracy for the multivariable model is
-{glue:text}`cv_RMSPE_2pred`.
+then we must compare the RMSPE estimated using only the training data via cross-validation.
+Looking back, the estimated cross-validation RMSPE for the single-predictor
+model was \${glue:text}`cv_RMSPE`.
+The estimated cross-validation RMSPE for the multivariable model is
+\${glue:text}`cv_RMSPE_2pred`.
 Thus in this case, we did not improve the model
 by a large amount by adding this additional predictor.
 
@@ -1090,7 +1090,7 @@ glue("RMSPE_mult", "{0:,.0f}".format(RMSPE_mult))
 
 This time, when we performed KNN regression on the same data set, but also
 included number of bedrooms as a predictor, we obtained a RMSPE test error
-of {glue:text}`RMSPE_mult`.
+of \${glue:text}`RMSPE_mult`.
 {numref}`fig:07-knn-mult-viz` visualizes the model's predictions overlaid on top of the data. This
 time the predictions are a surface in 3D space, instead of a line in 2D space, as we have 2
 predictors instead of 1.
diff --git a/source/regression2.md b/source/regression2.md
diff --git a/source/version-control.md b/source/version-control.md
diff --git a/source/viz.md b/source/viz.md

Original file line number	Diff line number	Diff line change
`@@ -1,2 +1,2 @@`
`1`	`1`	`chmod -R o+w source/`
`2`		`-docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231108192908c9b484 /bin/bash -c "jupyter-book build source"`
	`2`	`+docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231110054348fd23c8 /bin/bash -c "jupyter-book build source"`