Skip to content

Commit ef26fbf

Browse files
Merge pull request #317 from UBC-DSCI/graphic-design
Graphic design updates
2 parents 7c9a46a + cf7849b commit ef26fbf

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+8615
-122
lines changed

build_pdf.sh

+13
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,15 @@
1+
# Script to generate PDF book
2+
3+
# backup original index.Rmd
4+
cp source/index.md index_backup.md
5+
6+
# PDF book doesn't need the welcome page. I couldn't find a way to stop jupyterbook from including it.
7+
# so this script manually removes the welcome page before building the PDF. This is a bit painful, but it works...
8+
sed -n -i "/graphic/q;p" source/index.md
9+
echo "# Data Science: A First Introduction" >> source/index.md
10+
111
chmod -R o+w source/
212
docker run --rm -v $(pwd):/home/jovyan ubcdsci/py-intro-to-ds:20231112004031dd2207 /bin/bash -c "export BOOK_BUILD_TYPE='PDF'; jupyter-book build source --builder pdflatex"
13+
14+
# restore the backed up full index.Rmd
15+
mv index_backup.md source/index.md

source/_config.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Book settings
22
title: "Data Science: A First Introduction (Python Edition)"
3-
author: "The DSCI100 Development Team"
3+
author: "Tiffany Timbers, Trevor Campbell, Melissa Lee, Joel Ostblom, and Lindsey Heagy"
44
copyright: "2022" # Copyright year to be placed in the footer
55
logo: "" # A path to the book logo
66
# Patterns to skip when building the book. Can be glob-style (e.g. "*skip.ipynb")

source/_toc.yml

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ parts:
44
- caption: Front Matter
55
chapters:
66
- file: preface-text.md
7-
#- file: foreword.md
7+
- file: foreword-text.md
88
- file: acknowledgements.md
99
- file: authors.md
1010
- caption: Chapters

source/classification2.md

+7-7
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ when predicting whether a patient's tumor is malignant or benign!
8989

9090
+++
9191

92-
```{figure} img/classification2/training_test.jpeg
92+
```{figure} img/classification2/training_test.png
9393
:name: fig:06-training-test
9494
9595
Splitting the data into training and testing sets.
@@ -1301,7 +1301,7 @@ Here we are using the shortcut `point=True` to layer a point and line chart.
13011301
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
13021302
x=alt.X("n_neighbors").title("Neighbors"),
13031303
y=alt.Y("mean_test_score")
1304-
.scale(domain=(0.85, 0.90))
1304+
.scale(zero=False)
13051305
.title("Accuracy estimate")
13061306
)
13071307
@@ -1388,7 +1388,7 @@ large_accuracies_grid = pd.DataFrame(large_cancer_tune_grid.cv_results_)
13881388
large_accuracy_vs_k = alt.Chart(large_accuracies_grid).mark_line(point=True).encode(
13891389
x=alt.X("param_kneighborsclassifier__n_neighbors").title("Neighbors"),
13901390
y=alt.Y("mean_test_score")
1391-
.scale(domain=(0.60, 0.90))
1391+
.scale(zero=False)
13921392
.title("Accuracy estimate")
13931393
)
13941394
@@ -1664,7 +1664,7 @@ estimate its accuracy. The overall process is summarized in
16641664

16651665
+++
16661666

1667-
```{figure} img/classification2/train-test-overview.jpeg
1667+
```{figure} img/classification2/train-test-overview.png
16681668
:name: fig:06-overview
16691669
16701670
Overview of K-NN classification.
@@ -1836,7 +1836,7 @@ plt_irrelevant_accuracies = (
18361836
y=alt.Y(
18371837
"accs",
18381838
title="Model Accuracy Estimate",
1839-
scale=alt.Scale(domain=(0.80, 0.95)),
1839+
scale=alt.Scale(zero=False),
18401840
),
18411841
)
18421842
)
@@ -1899,7 +1899,7 @@ plt_irrelevant_nghbrs_fixed = (
18991899
x=alt.X("ks", title="Number of Irrelevant Predictors"),
19001900
y=alt.Y(
19011901
"Accuracy",
1902-
scale=alt.Scale(domain=(0.75, 0.95)),
1902+
scale=alt.Scale(zero=False),
19031903
),
19041904
color=alt.Color("Type"),
19051905
)
@@ -2140,7 +2140,7 @@ fwd_sel_accuracies_plot = (
21402140
y=alt.Y(
21412141
"accuracy",
21422142
title="Estimated Accuracy",
2143-
scale=alt.Scale(domain=(0.89, 0.935)),
2143+
scale=alt.Scale(zero=False),
21442144
),
21452145
)
21462146
)

source/clustering.md

+7-6
Original file line numberDiff line numberDiff line change
@@ -352,7 +352,7 @@ toy_example_clus1_center = alt.layer(
352352
x=alt.X("flipper_length_standardized"),
353353
y=alt.Y("bill_length_standardized")
354354
),
355-
alt.Chart(clus).mark_circle(color='coral', size=500, opacity=1).encode(
355+
alt.Chart(clus).mark_circle(color='steelblue', size=300, opacity=1, stroke='black').encode(
356356
x=alt.X("mean(flipper_length_standardized)")
357357
.scale(zero=False, padding=20)
358358
.title("Flipper Length (standardized)"),
@@ -373,7 +373,7 @@ in {numref}`toy-example-clus1-center`
373373
:figwidth: 700px
374374
:name: toy-example-clus1-center
375375

376-
Cluster 0 from the `penguins_standardized` data set example. Observations are in blue, with the cluster center highlighted in orange.
376+
Cluster 0 from the `penguins_standardized` data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline.
377377
:::
378378

379379
```{code-cell} ipython3
@@ -417,7 +417,7 @@ These distances are denoted by lines in {numref}`toy-example-clus1-dists` for th
417417
:figwidth: 700px
418418
:name: toy-example-clus1-dists
419419

420-
Cluster 0 from the `penguins_standardized` data set example. Observations are in blue, with the cluster center highlighted in orange. The distances from the observations to the cluster center are represented as black lines.
420+
Cluster 0 from the `penguins_standardized` data set example. Observations are small blue points, with the cluster center highlighted as a large blue point with a black outline. The distances from the observations to the cluster center are represented as black lines.
421421
:::
422422

423423
```{code-cell} ipython3
@@ -440,14 +440,15 @@ toy_example_all_clus_dists = alt.layer(
440440
alt.Y("bill_length_standardized"),
441441
alt.Color('cluster:N')
442442
),
443-
alt.Chart(penguins_clustered).mark_circle(color='coral', size=200, opacity=1).encode(
443+
alt.Chart(penguins_clustered).mark_circle(size=200, opacity=1, stroke = "black").encode(
444444
alt.X("mean(flipper_length_standardized)")
445445
.scale(zero=False)
446446
.title("Flipper Length (standardized)"),
447447
alt.Y("mean(bill_length_standardized)")
448448
.scale(zero=False)
449449
.title("Bill Length (standardized)"),
450-
alt.Detail('cluster:N')
450+
alt.Detail('cluster:N'),
451+
alt.Color('cluster:N')
451452
)
452453
)
453454
glue('toy-example-all-clus-dists', toy_example_all_clus_dists, display=True)
@@ -468,7 +469,7 @@ These distances are denoted by black lines in
468469
:figwidth: 700px
469470
:name: toy-example-all-clus-dists
470471

471-
All clusters from the `penguins_standardized` data set example. Observations are in blue, orange, and red with the cluster center highlighted in orange. The distances from the observations to each of the respective cluster centers are represented as black lines.
472+
All clusters from the `penguins_standardized` data set example. Observations are small orange, blue, and yellow points with cluster centers denoted by larger points with a black outline. The distances from the observations to each of the respective cluster centers are represented as black lines.
472473
:::
473474

474475
Since K-means uses the straight-line distance to measure the quality of a clustering,

source/foreword-text.md

100755100644
+7-6
Original file line numberDiff line numberDiff line change
@@ -13,13 +13,13 @@ kernelspec:
1313
name: python3
1414
---
1515

16-
# Foreword -- TBD
16+
# Foreword
1717

1818
*Roger D. Peng*
1919

2020
*Johns Hopkins Bloomberg School of Public Health*
2121

22-
*2022-01-04*
22+
*2023-11-30*
2323

2424
The field of data science has expanded and grown significantly in recent years,
2525
attracting excitement and interest from many different directions. The demand for introductory
@@ -44,9 +44,10 @@ is and what the implications are for the activities in which members of the fiel
4444

4545
The first important concept addressed by this book is tidy data, which is a format for
4646
tabular data formally introduced to the statistical community in a 2014 paper by Hadley
47-
Wickham. The tidy data organization strategy has proven a powerful abstract concept for
48-
conducting data analysis, in large part because of the vast toolchain implemented in the
49-
Tidyverse collection of R packages. The second key concept is the development of workflows
47+
Wickham. Although originally popularized within the R programming language community
48+
via the Tidyverse package collection, the tidy data format is a language-independent concept
49+
that facilitates the application of powerful generalized data cleaning and wrangling tools.
50+
The second key concept is the development of workflows
5051
for reproducible and auditable data analyses. Modern data analyses have only grown in
5152
complexity due to the availability of data and the ease with which we can implement complex
5253
data analysis procedures. Furthermore, these data analyses are often part of
@@ -61,7 +62,7 @@ collaboration is a core element of data science.
6162
This book takes these core concepts and focuses on how one can apply them to *do* data
6263
science in a rigorous manner. Students who learn from this book will be well-versed in
6364
the techniques and principles behind producing reliable evidence from data. This book is
64-
centered around the use of the R programming language within the tidy data framework,
65+
centered around the implementation of the tidy data framework within the Python programming language,
6566
and as such employs the most recent advances in data analysis coding. The use of Jupyter
6667
notebooks for exercises immediately places the student in an environment that encourages
6768
auditability and reproducibility of analyses. The integration of git and GitHub into the

source/img/classification2/ML-paradigm-test.ai

+1,707
Large diffs are not rendered by default.

source/img/classification2/ML-paradigm-test.png

100755100644
-246 KB
Loading

source/img/classification2/cv.ai

+3,059
Large diffs are not rendered by default.

source/img/classification2/cv.png

100755100644
-5.66 KB
Loading

0 commit comments

Comments
 (0)