Skip to content

Commit c684db8

Browse files
Merge pull request #274 from UBC-DSCI/various-fixes
Various fixes
2 parents ba4b252 + f412b5e commit c684db8

10 files changed

+116
-80
lines changed

build_html.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
chmod -R o+w source/
2-
docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231108192908c9b484 /bin/bash -c "jupyter-book build source"
2+
docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231110054348fd23c8 /bin/bash -c "jupyter-book build source"

build_pdf.sh

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,2 @@
11
chmod -R o+w source/
2-
docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231108192908c9b484 /bin/bash -c "export BOOK_BUILD_TYPE='PDF'; jupyter-book build source --builder pdflatex"
2+
docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231110054348fd23c8 /bin/bash -c "export BOOK_BUILD_TYPE='PDF'; jupyter-book build source --builder pdflatex"

source/classification1.md

+35-9
Original file line numberDiff line numberDiff line change
@@ -278,7 +278,7 @@ cancer["Class"].value_counts(normalize=True)
278278
```
279279

280280
Next, let's draw a colored scatter plot to visualize the relationship between the
281-
perimeter and concavity variables. Recall that `altair's` default palette
281+
perimeter and concavity variables. Recall that the default palette in `altair`
282282
is colorblind-friendly, so we can stick with that here.
283283

284284
```{code-cell} ipython3
@@ -332,7 +332,7 @@ points_df = pd.DataFrame(
332332
)
333333
perim_concav_with_new_point_df = pd.concat((cancer, points_df), ignore_index=True)
334334
# Find the euclidean distances from the new point to each of the points
335-
# in the orginal dataset
335+
# in the orginal data set
336336
my_distances = euclidean_distances(perim_concav_with_new_point_df[attrs])[
337337
len(cancer)
338338
][:-1]
@@ -430,7 +430,7 @@ points_df2 = pd.DataFrame(
430430
)
431431
perim_concav_with_new_point_df2 = pd.concat((cancer, points_df2), ignore_index=True)
432432
# Find the euclidean distances from the new point to each of the points
433-
# in the orginal dataset
433+
# in the orginal data set
434434
my_distances2 = euclidean_distances(perim_concav_with_new_point_df2[attrs])[
435435
len(cancer)
436436
][:-1]
@@ -639,6 +639,32 @@ new_obs_Concavity = 3.5
639639
)
640640
```
641641

642+
```{code-cell} ipython3
643+
:tags: [remove-cell]
644+
# code needed to render the latex table with distance calculations
645+
from IPython.display import Latex
646+
five_neighbors = (
647+
cancer
648+
[["Perimeter", "Concavity", "Class"]]
649+
.assign(dist_from_new = (
650+
(cancer["Perimeter"] - new_obs_Perimeter) ** 2
651+
+ (cancer["Concavity"] - new_obs_Concavity) ** 2
652+
)**(1/2))
653+
.nsmallest(5, "dist_from_new")
654+
).reset_index()
655+
656+
for i in range(5):
657+
glue(f"gn{i}_perim", "{:0.2f}".format(five_neighbors["Perimeter"][i]))
658+
glue(f"gn{i}_concav", "{:0.2f}".format(five_neighbors["Concavity"][i]))
659+
glue(f"gn{i}_class", five_neighbors["Class"][i])
660+
661+
# typeset perimeter,concavity with parentheses if negative for latex
662+
nperim = f"{five_neighbors['Perimeter'][i]:.2f}" if five_neighbors['Perimeter'][i] > 0 else f"({five_neighbors['Perimeter'][i]:.2f})"
663+
nconcav = f"{five_neighbors['Concavity'][i]:.2f}" if five_neighbors['Concavity'][i] > 0 else f"({five_neighbors['Concavity'][i]:.2f})"
664+
665+
glue(f"gdisteqn{i}", Latex(f"\sqrt{{(0-{nperim})^2+(3.5-{nconcav})^2}}={five_neighbors['dist_from_new'][i]:.2f}"))
666+
```
667+
642668
In {numref}`tab:05-multiknn-mathtable` we show in mathematical detail how
643669
we computed the `dist_from_new` variable (the
644670
distance to the new observation) for each of the 5 nearest neighbors in the
@@ -648,11 +674,11 @@ training data.
648674
:name: tab:05-multiknn-mathtable
649675
| Perimeter | Concavity | Distance | Class |
650676
|-----------|-----------|----------------------------------------|-------|
651-
| 0.24 | 2.65 | $\sqrt{(0-0.24)^2+(3.5-2.65)^2}=0.88$| Benign |
652-
| 0.75 | 2.87 | $\sqrt{(0-0.75)^2+(3.5-2.87)^2}=0.98$| Malignant |
653-
| 0.62 | 2.54 | $\sqrt{(0-0.62)^2+(3.5-2.54)^2}=1.14$| Malignant |
654-
| 0.42 | 2.31 | $\sqrt{(0-0.42)^2+(3.5-2.31)^2}=1.26$| Malignant |
655-
| -1.16 | 4.04 | $\sqrt{(0-(-1.16))^2+(3.5-4.04)^2}=1.28$| Benign |
677+
| {glue:text}`gn0_perim` | {glue:text}`gn0_concav` | {glue:}`gdisteqn0` | {glue:text}`gn0_class` |
678+
| {glue:text}`gn1_perim` | {glue:text}`gn1_concav` | {glue:}`gdisteqn1` | {glue:text}`gn1_class` |
679+
| {glue:text}`gn2_perim` | {glue:text}`gn2_concav` | {glue:}`gdisteqn2` | {glue:text}`gn2_class` |
680+
| {glue:text}`gn3_perim` | {glue:text}`gn3_concav` | {glue:}`gdisteqn3` | {glue:text}`gn3_class` |
681+
| {glue:text}`gn4_perim` | {glue:text}`gn4_concav` | {glue:}`gdisteqn4` | {glue:text}`gn4_class` |
656682
```
657683

658684
+++
@@ -757,7 +783,7 @@ points_df4 = pd.DataFrame(
757783
)
758784
perim_concav_with_new_point_df4 = pd.concat((cancer, points_df4), ignore_index=True)
759785
# Find the euclidean distances from the new point to each of the points
760-
# in the orginal dataset
786+
# in the orginal data set
761787
my_distances4 = euclidean_distances(perim_concav_with_new_point_df4[attrs])[
762788
len(cancer)
763789
][:-1]

source/clustering.md

+10-4
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,10 @@ kernelspec:
1818
```{code-cell} ipython3
1919
:tags: [remove-cell]
2020
21+
# get rid of futurewarnings from sklearn kmeans
22+
import warnings
23+
warnings.simplefilter(action='ignore', category=FutureWarning)
24+
2125
from chapter_preamble import *
2226
```
2327

@@ -391,8 +395,10 @@ that we learned about in {numref}`Chapter %s <classification1>`.
391395
In the {glue:text}`clus_rows_glue`-observation cluster example above,
392396
we would compute the WSSD $S^2$ via
393397

394-
395-
$S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right) + \left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right) + \left((x_4 - \mu_x)^2 + (y_4 - \mu_y)^2\right)$
398+
$$
399+
S^2 = \left((x_1 - \mu_x)^2 + (y_1 - \mu_y)^2\right) + \left((x_2 - \mu_x)^2 + (y_2 - \mu_y)^2\right)\\
400+
+ \left((x_3 - \mu_x)^2 + (y_3 - \mu_y)^2\right) + \left((x_4 - \mu_x)^2 + (y_4 - \mu_y)^2\right)
401+
$$
396402

397403
These distances are denoted by lines in {numref}`toy-example-clus1-dists` for the first cluster of the penguin data example.
398404

@@ -786,7 +792,7 @@ Total WSSD for K clusters ranging from 1 to 9.
786792
We can perform K-means in Python using a workflow similar to those
787793
in the earlier classification and regression chapters. We will begin
788794
by reading the original (i.e., unstandardized) subset of 18 observations
789-
from the penguins dataset.
795+
from the penguins data set.
790796

791797
```{code-cell} ipython3
792798
:tags: [remove-cell]
@@ -1050,7 +1056,7 @@ and guidance that the worksheets provide will function as intended.
10501056
clustering for when you expect there to be subgroups, and then subgroups within
10511057
subgroups, etc., in your data. In the realm of more general unsupervised
10521058
learning, it covers *principal components analysis (PCA)*, which is a very
1053-
popular technique for reducing the number of predictors in a dataset.
1059+
popular technique for reducing the number of predictors in a data set.
10541060

10551061
## References
10561062

source/inference.md

+4-10
Original file line numberDiff line numberDiff line change
@@ -1197,12 +1197,6 @@ sample. Since the bootstrap distribution pretty well approximates the sampling
11971197
distribution spread, we can use the bootstrap spread to help us develop a
11981198
plausible range for our population parameter along with our estimate!
11991199

1200-
```{code-cell} ipython3
1201-
:tags: [remove-cell]
1202-
1203-
!wget -O img/inference/11-bootstrapping7-1.png https://datasciencebook.ca/_main_files/figure-html/11-bootstrapping7-1.png
1204-
```
1205-
12061200
```{figure} img/inference/11-bootstrapping7-1.png
12071201
:name: fig:11-bootstrapping7
12081202
@@ -1244,7 +1238,7 @@ To calculate a 95\% percentile bootstrap confidence interval, we will do the fol
12441238
To do this in Python, we can use the `quantile` function of our DataFrame.
12451239
Quantiles are expressed in proportions rather than percentages,
12461240
so the 2.5th and 97.5th percentiles
1247-
would be quantiles 0.025 and 0.975, respectively.
1241+
would be the 0.025 and 0.975 quantiles, respectively.
12481242

12491243
```{index} numpy; percentile, pandas.DataFrame; df[]
12501244
```
@@ -1257,8 +1251,8 @@ ci_bounds
12571251
```{code-cell} ipython3
12581252
:tags: [remove-cell]
12591253
1260-
glue("ci_lower", "{:.1f}".format(ci_bounds[0.025]))
1261-
glue("ci_upper", "{:.1f}".format(ci_bounds[0.975]))
1254+
glue("ci_lower", "{:.2f}".format(ci_bounds[0.025]))
1255+
glue("ci_upper", "{:.2f}".format(ci_bounds[0.975]))
12621256
```
12631257

12641258
Our interval, \${glue:text}`ci_lower` to \${glue:text}`ci_upper`, captures
@@ -1306,7 +1300,7 @@ estimate and our confidence interval's lower and upper bounds. Here the sample
13061300
mean price-per-night of 40 Airbnb listings was
13071301
\${glue:text}`one_sample_mean`, and we are 95\% "confident" that the true
13081302
population mean price-per-night for all Airbnb listings in Vancouver is between
1309-
\$({glue:text}`ci_lower`, {glue:text}`ci_upper`).
1303+
\${glue:text}`ci_lower` and \${glue:text}`ci_upper`.
13101304
Notice that our interval does indeed contain the true
13111305
population mean value, \${glue:text}`population_mean`\! However, in
13121306
practice, we would not know whether our interval captured the population

source/reading.md

+47-37
Original file line numberDiff line numberDiff line change
@@ -80,22 +80,21 @@ functions, we first need to talk about *where* the data lives. When you load a
8080
data set into Python, you first need to tell Python where those files live. The file
8181
could live on your computer (*local*) or somewhere on the internet (*remote*).
8282

83-
The place where the file lives on your computer is called the "path". You can
83+
The place where the file lives on your computer is referred to as its "path". You can
8484
think of the path as directions to the file. There are two kinds of paths:
85-
*relative* paths and *absolute* paths. A relative path is where the file is
86-
with respect to where you currently are on the computer (e.g., where the file
87-
you're working in is). On the other hand, an absolute path is where the file is
88-
in respect to the computer's filesystem base (or root) folder.
85+
*relative* paths and *absolute* paths. A relative path indicates where the file is
86+
with respect to your *working directory* (i.e., "where you are currently") on the computer.
87+
On the other hand, an absolute path indicates where the file is
88+
with respect to the computer's filesystem base (or *root*) folder, regardless of where you are working.
8989

9090
```{index} Happiness Report
9191
```
9292

9393
Suppose our computer's filesystem looks like the picture in
94-
{numref}`Filesystem`, and we are working in a
95-
file titled `worksheet_02.ipynb`. If we want to
96-
read the `.csv` file named `happiness_report.csv` into Python, we could do this
97-
using either a relative or an absolute path. We show both choices
98-
below.
94+
{numref}`Filesystem`. We are working in a
95+
file titled `worksheet_02.ipynb`, and our current working directory is `worksheet_02`;
96+
typically, as is the case here, the working directory is the directory containing the file you are currently
97+
working on.
9998

10099
```{figure} img/reading/filesystem.jpeg
101100
---
@@ -105,34 +104,42 @@ name: Filesystem
105104
Example file system
106105
```
107106

108-
109-
**Reading `happiness_report.csv` using a relative path:**
110-
111-
+++
112-
107+
Let's say we wanted to open the `happiness_report.csv` file. We have two options to indicate
108+
where the file is: using a relative path, or using an absolute path.
109+
The absolute path of the file always starts with a slash `/`&mdash;representing the root folder on the computer&mdash;and
110+
proceeds by listing out the sequence of folders you would have to enter to reach the file, each separated by another slash `/`.
111+
So in this case, `happiness_report.csv` would be reached by starting at the root, and entering the `home` folder,
112+
then the `dsci-100` folder, then the `worksheet_02` folder, and then finally the `data` folder. So its absolute
113+
path would be `/home/dsci-100/worksheet_02/data/happiness_report.csv`. We can load the file using its absolute path
114+
as a string passed to the `read_csv` function from `pandas`.
113115
```python
114-
happy_data = pd.read_csv("data/happiness_report.csv")
116+
happy_data = pd.read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
115117
```
116-
117-
+++
118-
119-
**Reading `happiness_report.csv` using an absolute path:**
120-
121-
+++
122-
118+
If we instead wanted to use a relative path, we would need to list out the sequence of steps needed to get from our current
119+
working directory to the file, with slashes `/` separating each step. Since we are currently in the `worksheet_02` folder,
120+
we just need to enter the `data` folder to reach our desired file. Hence the relative path is `data/happiness_report.csv`,
121+
and we can load the file using its relative path as a string passed to `read_csv`.
123122
```python
124-
happy_data = pd.read_csv("/home/dsci-100/worksheet_02/data/happiness_report.csv")
123+
happy_data = pd.read_csv("data/happiness_report.csv")
125124
```
125+
Note that there is no forward slash at the beginning of a relative path; if we accidentally typed `"/data/happiness_report.csv"`,
126+
Python would look for a folder named `data` in the root folder of the computer&mdash;but that doesn't exist!
126127

127-
+++
128+
Aside from specifying places to go in a path using folder names (like `data` and `worksheet_02`), we can also specify two additional
129+
special places: the *current directory* and the *previous directory*. We indicate the current working directory with a single dot `.`, and
130+
the previous directory with two dots `..`. So for instance, if we wanted to reach the `bike_share.csv` file from the `worksheet_02` folder, we could
131+
use the relative path `../tutorial_01/bike_share.csv`. We can even combine these two; for example, we could reach the `bike_share.csv` file using
132+
the (very silly) path `../tutorial_01/../tutorial_01/./bike_share.csv` with quite a few redundant directions: it says to go back a folder, then open `tutorial_01`,
133+
then go back a folder again, then open `tutorial_01` again, then stay in the current directory, then finally get to `bike_share.csv`. Whew, what a long trip!
128134

129-
So which one should you use? Generally speaking, to ensure your code can be run
130-
on a different computer, you should use relative paths. An added bonus is that
131-
it's also less typing! Generally, you should use relative paths because the file's
132-
absolute path (the names of
133-
folders between the computer's root `/` and the file) isn't usually the same
134-
across different computers. For example, suppose Fatima and Jayden are working on a
135-
project together on the `happiness_report.csv` data. Fatima's file is stored at
135+
So which kind of path should you use: relative, or absolute? Generally speaking, you should use relative paths.
136+
Using a relative path helps ensure that your code can be run
137+
on a different computer (and as an added bonus, relative paths are often shorter&mdash;easier to type!).
138+
This is because a file's relative path is often the same across different computers, while a
139+
file's absolute path (the names of
140+
all of the folders between the computer's root, represented by `/`, and the file) isn't usually the same
141+
across different computers. For example, suppose Fatima and Jayden are working on a
142+
project together on the `happiness_report.csv` data. Fatima's file is stored at
136143

137144
```
138145
/home/Fatima/project/data/happiness_report.csv
@@ -150,16 +157,19 @@ their different usernames. If Jayden has code that loads the
150157
`happiness_report.csv` data using an absolute path, the code won't work on
151158
Fatima's computer. But the relative path from inside the `project` folder
152159
(`data/happiness_report.csv`) is the same on both computers; any code that uses
153-
relative paths will work on both!
160+
relative paths will work on both! In the additional resources section,
161+
we include a link to a short video on the
162+
difference between absolute and relative paths.
154163

155164
```{index} URL
156165
```
157166

158-
Your file could be stored locally, as we discussed, or it could also be
159-
somewhere on the internet (remotely). For this purpose we use a
167+
Beyond files stored on your computer (i.e., locally), we also need a way to locate resources
168+
stored elsewhere on the internet (i.e., remotely). For this purpose we use a
160169
*Uniform Resource Locator (URL)*, i.e., a web address that looks something
161-
like https://google.com/. URLs indicate the location of a resource on the internet and
162-
helps us retrieve that resource.
170+
like https://datasciencebook.ca/. URLs indicate the location of a resource on the internet, and
171+
start with a web domain, followed by a forward slash `/`, and then a path
172+
to where the resource is located on the remote machine.
163173

164174
## Reading tabular data from a plain text file into Python
165175

source/regression1.md

+11-11
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ By the end of the chapter, readers will be able to do the following:
5454
* Recognize situations where a simple regression analysis would be appropriate for making predictions.
5555
* Explain the K-nearest neighbor (KNN) regression algorithm and describe how it differs from KNN classification.
5656
* Interpret the output of a KNN regression.
57-
* In a dataset with two or more variables, perform K-nearest neighbor regression in Python using a `scikit-learn` workflow.
57+
* In a data set with two or more variables, perform K-nearest neighbor regression in Python using a `scikit-learn` workflow.
5858
* Execute cross-validation in Python to choose the number of neighbors.
5959
* Evaluate KNN regression prediction accuracy in Python using a test data set and the root mean squared prediction error (RMSPE).
6060
* In the context of KNN regression, compare and contrast goodness of fit and prediction properties (namely RMSE vs RMSPE).
@@ -644,8 +644,8 @@ Alright, now the `mean_test_score` variable actually has values of the RMSPE
644644
for different numbers of neighbors. Finally, the `sem_test_score` variable
645645
contains the standard error of our cross-validation RMSPE estimate, which
646646
is a measure of how uncertain we are in the mean value. Roughly, if
647-
your estimated mean RMSPE is 100,000 and standard error is 1,000, you can expect the
648-
*true* RMSPE to be somewhere roughly between 99,000 and 101,000 (although it
647+
your estimated mean RMSPE is \$100,000 and standard error is \$1,000, you can expect the
648+
*true* RMSPE to be somewhere roughly between \$99,000 and \$101,000 (although it
649649
may fall outside this range).
650650

651651
{numref}`fig:07-choose-k-knn-plot` visualizes how the RMSPE varies with the number of neighbors $K$.
@@ -795,8 +795,8 @@ In this case the orange line becomes extremely smooth, and actually becomes flat
795795
once $K$ is equal to the number of datapoints in the entire data set.
796796
This happens because our predicted values for a given x value (here, home
797797
size), depend on many neighboring observations; in the case where $K$ is equal
798-
to the size of the dataset, the prediction is just the mean of the house prices
799-
in the dataset (completely ignoring the house size).
798+
to the size of the data set, the prediction is just the mean of the house prices
799+
in the data set (completely ignoring the house size).
800800
In contrast to the $K=1$ example,
801801
the smooth, inflexible orange line does not follow the training observations very closely.
802802
In other words, the model is *not influenced enough* by the training data.
@@ -1057,11 +1057,11 @@ Here we see that the smallest estimated RMSPE from cross-validation occurs when
10571057
If we want to compare this multivariable KNN regression model to the model with only a single
10581058
predictor *as part of the model tuning process* (e.g., if we are running forward selection as described
10591059
in the chapter on evaluating and tuning classification models),
1060-
then we must compare the accuracy estimated using only the training data via cross-validation.
1061-
Looking back, the estimated cross-validation accuracy for the single-predictor
1062-
model was {glue:text}`cv_RMSPE`.
1063-
The estimated cross-validation accuracy for the multivariable model is
1064-
{glue:text}`cv_RMSPE_2pred`.
1060+
then we must compare the RMSPE estimated using only the training data via cross-validation.
1061+
Looking back, the estimated cross-validation RMSPE for the single-predictor
1062+
model was \${glue:text}`cv_RMSPE`.
1063+
The estimated cross-validation RMSPE for the multivariable model is
1064+
\${glue:text}`cv_RMSPE_2pred`.
10651065
Thus in this case, we did not improve the model
10661066
by a large amount by adding this additional predictor.
10671067

@@ -1090,7 +1090,7 @@ glue("RMSPE_mult", "{0:,.0f}".format(RMSPE_mult))
10901090

10911091
This time, when we performed KNN regression on the same data set, but also
10921092
included number of bedrooms as a predictor, we obtained a RMSPE test error
1093-
of {glue:text}`RMSPE_mult`.
1093+
of \${glue:text}`RMSPE_mult`.
10941094
{numref}`fig:07-knn-mult-viz` visualizes the model's predictions overlaid on top of the data. This
10951095
time the predictions are a surface in 3D space, instead of a line in 2D space, as we have 2
10961096
predictors instead of 1.

0 commit comments

Comments
 (0)