Skip to content

Commit a182ad5

Browse files
Fixes to visualizations throughout the book (#160)
* Replace ggplot terminology * Correct explanation of maxbins * Clarify language and use altair terminology * Make use of parenthesis consistent within the chapter * Consistently use multiline syntax * Explain alt.X and alt.Y and be consistent in using it only when there are no options passed * Simplify plot by extending the range of the data instead of manipulating the axis ticks * Clarify the differences between mark_shape and mark_circle and introduce them explicitly * Clarify explanation of multiline titles * Emphasize operation by putting it first * Explain the formatting of the log plot more carefully and improve the final version by addressing the missing tick label * Clarify which number is the canadian population * Increase the chart dimensions to fit all the x axis labels * Remove legend titles and lay labels out vertically when on top The old behavior seems to be a replica of the R plot which was due to how did ggplot handles legends * Clarify text around color schemes * Show how to add a tooltip and explain the advantages that brings * Include note about the danger of using barplots for measures of central tendency * Clarify explanation of bar plot section * Remove redundant titles * Properly explain what a histogram is * Add example on how to use maxbins Previously it was just silently introduced it seems * Restructure text and modify line appearance to be more readable * Simplify how to facet charts and improve the explanations * Move all the maxbins sections into one place * Change the assignment method of a single column the regular syntax instaed of using assign and overwriting the entire df * Update language to reflect that we are actually looking at relative error rather than relative accuracy * Name the new column according to the naming scheme of the existing columns All columns in the dataframe should follow the same naming scheme * Change first maxbins example to also use relative error * Simplify logic of last figure and make it easier to read * Standardize visualization syntax across chapters I went with what was used in most places, including the visualization chapter. This is also the style in the altiar documentation. We can discuss if we want to change this but for now it makes sense to at least have the same syntax everywhere instead of a mix which could be confusing. * Commit changes to saved svg chart * Add explicit note regarding code syntax * Explicitly mention re-using chart variable * Extend faded prediction area to the axes limits in stead of having a white border around it * Commit changes to saved png chart * viz ref in intro * Fix scale to not squish horizontally * Improve language * Add explanation of underlines in numbers * Improve language * Clarify question wording * Remove bar chart explanation * Fix wording * Remove title section --------- Co-authored-by: Trevor Campbell <[email protected]>
1 parent 04407bd commit a182ad5

8 files changed

+555
-481
lines changed

source/classification1.md

Lines changed: 29 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -290,14 +290,10 @@ perimeter and concavity variables. Recall that `altair's` default palette
290290
is colorblind-friendly, so we can stick with that here.
291291

292292
```{code-cell} ipython3
293-
perim_concav = (
294-
alt.Chart(cancer)
295-
.mark_circle()
296-
.encode(
297-
x=alt.X("Perimeter", title="Perimeter (standardized)"),
298-
y=alt.Y("Concavity", title="Concavity (standardized)"),
299-
color=alt.Color("Class", title="Diagnosis"),
300-
)
293+
perim_concav = alt.Chart(cancer).mark_circle().encode(
294+
x=alt.X("Perimeter", title="Perimeter (standardized)"),
295+
y=alt.Y("Concavity", title="Concavity (standardized)"),
296+
color=alt.Color("Class", title="Diagnosis"),
301297
)
302298
perim_concav
303299
```
@@ -1441,14 +1437,10 @@ rare_cancer = pd.concat((
14411437
cancer[cancer["Class"] == 'Malignant'].head(3)
14421438
))
14431439
1444-
rare_plot = (
1445-
alt.Chart(rare_cancer)
1446-
.mark_circle()
1447-
.encode(
1448-
x=alt.X("Perimeter", title="Perimeter (standardized)"),
1449-
y=alt.Y("Concavity", title="Concavity (standardized)"),
1450-
color=alt.Color("Class", title="Diagnosis"),
1451-
)
1440+
rare_plot = alt.Chart(rare_cancer).mark_circle().encode(
1441+
x=alt.X("Perimeter", title="Perimeter (standardized)"),
1442+
y=alt.Y("Concavity", title="Concavity (standardized)"),
1443+
color=alt.Color("Class", title="Diagnosis"),
14521444
)
14531445
rare_plot
14541446
```
@@ -1555,10 +1547,10 @@ knn.fit(X=rare_cancer.loc[:, ["Perimeter", "Concavity"]], y=rare_cancer["Class"]
15551547
15561548
# create a prediction pt grid
15571549
per_grid = np.linspace(
1558-
rare_cancer["Perimeter"].min(), rare_cancer["Perimeter"].max(), 50
1550+
rare_cancer["Perimeter"].min() * 1.05, rare_cancer["Perimeter"].max() * 1.05, 50
15591551
)
15601552
con_grid = np.linspace(
1561-
rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max(), 50
1553+
rare_cancer["Concavity"].min() * 1.05, rare_cancer["Concavity"].max() * 1.05, 50
15621554
)
15631555
pcgrid = np.array(np.meshgrid(per_grid, con_grid)).reshape(2, -1).T
15641556
pcgrid = pd.DataFrame(pcgrid, columns=["Perimeter", "Concavity"])
@@ -1594,14 +1586,16 @@ prediction_plot = (
15941586
"Perimeter",
15951587
title="Perimeter (standardized)",
15961588
scale=alt.Scale(
1597-
domain=(rare_cancer["Perimeter"].min(), rare_cancer["Perimeter"].max())
1589+
domain=(rare_cancer["Perimeter"].min() * 1.05, rare_cancer["Perimeter"].max() * 1.05),
1590+
nice=False
15981591
),
15991592
),
16001593
y=alt.Y(
16011594
"Concavity",
16021595
title="Concavity (standardized)",
16031596
scale=alt.Scale(
1604-
domain=(rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max())
1597+
domain=(rare_cancer["Concavity"].min() * 1.05, rare_cancer["Concavity"].max() * 1.05),
1598+
nice=False
16051599
),
16061600
),
16071601
color=alt.Color("Class", title="Diagnosis"),
@@ -1685,14 +1679,16 @@ rare_plot = (
16851679
"Perimeter",
16861680
title="Perimeter (standardized)",
16871681
scale=alt.Scale(
1688-
domain=(rare_cancer["Perimeter"].min(), rare_cancer["Perimeter"].max())
1682+
domain=(rare_cancer["Perimeter"].min() * 1.05, rare_cancer["Perimeter"].max() * 1.05),
1683+
nice=False
16891684
),
16901685
),
16911686
y=alt.Y(
16921687
"Concavity",
16931688
title="Concavity (standardized)",
16941689
scale=alt.Scale(
1695-
domain=(rare_cancer["Concavity"].min(), rare_cancer["Concavity"].max())
1690+
domain=(rare_cancer["Concavity"].min() * 1.05, rare_cancer["Concavity"].max() * 1.05),
1691+
nice=False
16961692
),
16971693
),
16981694
color=alt.Color("Class", title="Diagnosis"),
@@ -1809,10 +1805,10 @@ import numpy as np
18091805
18101806
# create the grid of area/smoothness vals, and arrange in a data frame
18111807
are_grid = np.linspace(
1812-
unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max(), 50
1808+
unscaled_cancer["Area"].min() * 0.95, unscaled_cancer["Area"].max() * 1.05, 50
18131809
)
18141810
smo_grid = np.linspace(
1815-
unscaled_cancer["Smoothness"].min(), unscaled_cancer["Smoothness"].max(), 50
1811+
unscaled_cancer["Smoothness"].min() * 0.95, unscaled_cancer["Smoothness"].max() * 1.05, 50
18161812
)
18171813
asgrid = np.array(np.meshgrid(are_grid, smo_grid)).reshape(2, -1).T
18181814
asgrid = pd.DataFrame(asgrid, columns=["Area", "Smoothness"])
@@ -1836,17 +1832,19 @@ unscaled_plot = (
18361832
"Area",
18371833
title="Area",
18381834
scale=alt.Scale(
1839-
domain=(unscaled_cancer["Area"].min(), unscaled_cancer["Area"].max())
1840-
),
1835+
domain=(unscaled_cancer["Area"].min() * 0.95, unscaled_cancer["Area"].max() * 1.05),
1836+
nice=False
1837+
)
18411838
),
18421839
y=alt.Y(
18431840
"Smoothness",
18441841
title="Smoothness",
18451842
scale=alt.Scale(
18461843
domain=(
1847-
unscaled_cancer["Smoothness"].min(),
1848-
unscaled_cancer["Smoothness"].max(),
1849-
)
1844+
unscaled_cancer["Smoothness"].min() * 0.95,
1845+
unscaled_cancer["Smoothness"].max() * 1.05,
1846+
),
1847+
nice=False
18501848
),
18511849
),
18521850
color=alt.Color("Class", title="Diagnosis"),
@@ -1858,8 +1856,8 @@ prediction_plot = (
18581856
alt.Chart(prediction_table)
18591857
.mark_point(opacity=0.05, filled=True, size=300)
18601858
.encode(
1861-
x=alt.X("Area"),
1862-
y=alt.Y("Smoothness"),
1859+
x="Area",
1860+
y="Smoothness",
18631861
color=alt.Color("Class", title="Diagnosis"),
18641862
)
18651863
)

source/classification2.md

Lines changed: 44 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -330,16 +330,11 @@ cancer['Class'] = cancer['Class'].replace({
330330
# create scatter plot of tumor cell concavity versus smoothness,
331331
# labeling the points be diagnosis class
332332
333-
perim_concav = (
334-
alt.Chart(cancer)
335-
.mark_circle()
336-
.encode(
337-
x="Smoothness",
338-
y="Concavity",
339-
color=alt.Color("Class", title="Diagnosis"),
340-
)
333+
perim_concav = alt.Chart(cancer).mark_circle().encode(
334+
x=alt.X("Smoothness", scale=alt.Scale(zero=False)),
335+
y="Concavity",
336+
color=alt.Color("Class", title="Diagnosis"),
341337
)
342-
343338
perim_concav
344339
```
345340

@@ -1081,19 +1076,15 @@ as shown in {numref}`fig:06-find-k`.
10811076
```{code-cell} ipython3
10821077
:tags: [remove-output]
10831078
1084-
accuracy_vs_k = (
1085-
alt.Chart(accuracies_grid)
1086-
.mark_line(point=True)
1087-
.encode(
1088-
x=alt.X(
1089-
"n_neighbors",
1090-
title="Neighbors",
1091-
),
1092-
y=alt.Y(
1093-
"mean_test_score",
1094-
title="Accuracy estimate",
1095-
scale=alt.Scale(domain=(0.85, 0.90)),
1096-
),
1079+
accuracy_vs_k = alt.Chart(accuracies_grid).mark_line(point=True).encode(
1080+
x=alt.X(
1081+
"n_neighbors",
1082+
title="Neighbors",
1083+
),
1084+
y=alt.Y(
1085+
"mean_test_score",
1086+
title="Accuracy estimate",
1087+
scale=alt.Scale(domain=(0.85, 0.90)),
10971088
)
10981089
)
10991090
@@ -1170,19 +1161,15 @@ large_accuracies_grid = pd.DataFrame(
11701161
).cv_results_
11711162
)
11721163
1173-
large_accuracy_vs_k = (
1174-
alt.Chart(large_accuracies_grid)
1175-
.mark_line(point=True)
1176-
.encode(
1177-
x=alt.X(
1178-
"param_kneighborsclassifier__n_neighbors",
1179-
title="Neighbors",
1180-
),
1181-
y=alt.Y(
1182-
"mean_test_score",
1183-
title="Accuracy estimate",
1184-
scale=alt.Scale(domain=(0.60, 0.90)),
1185-
),
1164+
large_accuracy_vs_k = alt.Chart(large_accuracies_grid).mark_line(point=True).encode(
1165+
x=alt.X(
1166+
"param_kneighborsclassifier__n_neighbors",
1167+
title="Neighbors",
1168+
),
1169+
y=alt.Y(
1170+
"mean_test_score",
1171+
title="Accuracy estimate",
1172+
scale=alt.Scale(domain=(0.60, 0.90)),
11861173
)
11871174
)
11881175
@@ -1269,10 +1256,10 @@ y = cancer_train["Class"]
12691256
12701257
# create a prediction pt grid
12711258
smo_grid = np.linspace(
1272-
cancer_train["Smoothness"].min(), cancer_train["Smoothness"].max(), 100
1259+
cancer_train["Smoothness"].min() * 0.95, cancer_train["Smoothness"].max() * 1.05, 100
12731260
)
12741261
con_grid = np.linspace(
1275-
cancer_train["Concavity"].min(), cancer_train["Concavity"].max(), 100
1262+
cancer_train["Concavity"].min() - 0.025, cancer_train["Concavity"].max() * 1.05, 100
12761263
)
12771264
scgrid = np.array(np.meshgrid(smo_grid, con_grid)).reshape(2, -1).T
12781265
scgrid = pd.DataFrame(scgrid, columns=["Smoothness", "Concavity"])
@@ -1294,8 +1281,26 @@ for k in [1, 7, 20, 300]:
12941281
)
12951282
.mark_point(opacity=0.2, filled=True, size=20)
12961283
.encode(
1297-
x=alt.X("Smoothness"),
1298-
y=alt.Y("Concavity"),
1284+
x=alt.X(
1285+
"Smoothness",
1286+
scale=alt.Scale(
1287+
domain=(
1288+
cancer_train["Smoothness"].min() * 0.95,
1289+
cancer_train["Smoothness"].max() * 1.05
1290+
),
1291+
nice=False
1292+
)
1293+
),
1294+
y=alt.Y(
1295+
"Concavity",
1296+
scale=alt.Scale(
1297+
domain=(
1298+
cancer_train["Concavity"].min() -0.025,
1299+
cancer_train["Concavity"].max() * 1.05
1300+
),
1301+
nice=False
1302+
)
1303+
),
12991304
color=alt.Color("Class", title="Diagnosis"),
13001305
)
13011306
)

source/img/faithful_plot.png

9.66 KB
Loading

source/img/faithful_plot.svg

Lines changed: 1 addition & 1 deletion
Loading

source/intro.md

Lines changed: 26 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -804,13 +804,12 @@ import altair as alt
804804

805805
+++
806806

807-
The fundamental object in `altair` is the `Chart`, which takes a data frame as a single argument: `alt.Chart(ten_lang)`.
807+
The fundamental object in `altair` is the `Chart`, which takes a data frame as an argument: `alt.Chart(ten_lang)`.
808808
With a chart object in hand, we can now specify how we would like the data to be visualized.
809-
We first indicate what kind of geometric mark we want to use to represent the data. Here we set the mark attribute
809+
We first indicate what kind of graphical *mark* we want to use to represent the data. Here we set the mark attribute
810810
of the chart object using the `Chart.mark_bar` function, because we want to create a bar chart.
811-
Next, we need to encode the variables of the data frame using
812-
the `x` (represents the x-axis position of the points) and
813-
`y` (represents the y-axis position of the points) *channels*. We use the `encode()`
811+
Next, we need to *encode* the variables of the data frame using
812+
the `x` and `y` *channels* (which represent the x-axis and y-axis position of the points). We use the `encode()`
814813
function to handle this: we specify that the `language` column should correspond to the x-axis,
815814
and that the `mother_tongue` column should correspond to the y-axis.
816815

@@ -853,7 +852,7 @@ Bar plot of the ten Aboriginal languages most often reported by Canadian residen
853852
```{index} see: .; chaining methods
854853
```
855854

856-
### Formatting `altair` objects
855+
### Formatting `altair` charts
857856

858857
It is exciting that we can already visualize our data to help answer our
859858
question, but we are not done yet! We can (and should) do more to improve the
@@ -865,28 +864,27 @@ example above, Python uses the column name `mother_tongue` as the label for the
865864
y axis, but most people will not know what that is. And even if they did, they
866865
will not know how we measured this variable, or the group of people on which the
867866
measurements were taken. An axis label that reads "Mother Tongue (Number of
868-
Canadian Residents)" would be much more informative.
867+
Canadian Residents)" would be much more informative. To make the code easier to
868+
read, we're spreading it out over multiple lines just as we did in the previous
869+
section with pandas.
869870

870871
```{index} plot; labels, plot; axis labels
871872
```
872873

873874
Adding additional labels to our visualizations that we create in `altair` is
874875
one common and easy way to improve and refine our data visualizations. We can add titles for the axes
875876
in the `altair` objects using `alt.X` and `alt.Y` with the `title` argument to make
876-
the axes titles more informative.
877+
the axes titles more informative (you will learn more about `alt.X` and `alt.Y` in the {ref}`viz` chapter).
877878
Again, since we are specifying
878879
words (e.g. `"Mother Tongue (Number of Canadian Residents)"`) as arguments to
879880
`alt.X` and `alt.Y`, we surround them with double quotation marks. We can do many other modifications
880881
to format the plot further, and we will explore these in the {ref}`viz` chapter.
881882

882883
```{code-cell} ipython3
883-
barplot_mother_tongue = (
884-
alt.Chart(ten_lang)
885-
.mark_bar().encode(
886-
x=alt.X('language', title='Language'),
887-
y=alt.Y('mother_tongue', title='Mother Tongue (Number of Canadian Residents)')
888-
))
889-
884+
barplot_mother_tongue = alt.Chart(ten_lang).mark_bar().encode(
885+
x=alt.X('language', title='Language'),
886+
y=alt.Y('mother_tongue', title='Mother Tongue (Number of Canadian Residents)')
887+
)
890888
```
891889

892890

@@ -915,13 +913,10 @@ To accomplish this, we will swap the x and y coordinate axes:
915913

916914

917915
```{code-cell} ipython3
918-
barplot_mother_tongue_axis = (
919-
alt.Chart(ten_lang)
920-
.mark_bar().encode(
921-
x=alt.X('mother_tongue', title='Mother Tongue (Number of Canadian Residents)'),
922-
y=alt.Y('language', title='Language')
923-
))
924-
916+
barplot_mother_tongue_axis = alt.Chart(ten_lang).mark_bar().encode(
917+
x=alt.X('mother_tongue', title='Mother Tongue (Number of Canadian Residents)'),
918+
y=alt.Y('language', title='Language')
919+
)
925920
```
926921

927922
```{code-cell} ipython3
@@ -951,13 +946,10 @@ the `sort` argument, which orders a variable (here `language`) based on the
951946
values of the variable(`mother_tongue`) on the `x-axis`.
952947

953948
```{code-cell} ipython3
954-
ordered_barplot_mother_tongue = (
955-
alt.Chart(ten_lang)
956-
.mark_bar().encode(
957-
x=alt.X('mother_tongue', title='Mother Tongue (Number of Canadian Residents)'),
958-
y=alt.Y('language', sort='x', title='Language')
959-
))
960-
949+
ordered_barplot_mother_tongue = alt.Chart(ten_lang).mark_bar().encode(
950+
x=alt.X('mother_tongue', title='Mother Tongue (Number of Canadian Residents)'),
951+
y=alt.Y('language', sort='x', title='Language')
952+
)
961953
```
962954

963955
+++
@@ -1028,17 +1020,13 @@ ten_lang = (
10281020
can_lang.loc[can_lang["category"] == "Aboriginal languages", ["language", "mother_tongue"]]
10291021
.sort_values(by="mother_tongue", ascending=False)
10301022
.head(10)
1031-
)
1023+
)
10321024
10331025
# create the visualization
1034-
ten_lang_plot = (
1035-
alt.Chart(ten_lang)
1036-
.mark_bar().encode(
1037-
x=alt.X('mother_tongue', title='Mother Tongue (Number of Canadian Residents)'),
1038-
y=alt.Y('language', sort='x', title='Language')
1039-
))
1040-
1041-
1026+
ten_lang_plot = alt.Chart(ten_lang).mark_bar().encode(
1027+
x=alt.X('mother_tongue', title='Mother Tongue (Number of Canadian Residents)'),
1028+
y=alt.Y('language', sort='x', title='Language')
1029+
)
10421030
```
10431031

10441032
```{code-cell} ipython3

0 commit comments

Comments
 (0)