From f893478091a10a4818c62c64a90feb961a8e86b5 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Fri, 10 Nov 2023 21:48:22 -0800 Subject: [PATCH 01/11] added better barplot discussion --- source/viz.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/source/viz.md b/source/viz.md index 8905556c..a2e05237 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1168,6 +1168,18 @@ In a bar plot, the height of the bar represents the value of a summary statistic They are particularly useful for comparing summary statistics between different groups of a categorical variable. + +Here, we have a data frame of Earth's landmasses, +and are trying to compare their sizes. +The right type of visualization to answer this question is a bar plot. +In a bar plot, the height of each bar represents the value of an *amount* +(a size, count, proportion, percentage, etc). +They are particularly useful for comparing counts or proportions across different +groups of a categorical variable. Note, however, that bar plots should generally not be +used to display mean or median values, as they hide important information about +the variation of the data. Instead it's better to show the distribution of +all the individual data points, e.g., using a histogram, which we will discuss further in {numref}`histogramsviz`. + ```{index} altair; mark_bar ``` @@ -1292,6 +1304,7 @@ visualization for answering our original questions. Landmasses are organized by their size, and continents are colored differently than other landmasses, making it quite clear that all the seven largest landmasses are continents. +(histogramsviz)= ### Histograms: the Michelson speed of light data set ```{index} Michelson speed of light From e871cfe322dde73c9cb50a9b95ed873ac0c85484 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Fri, 10 Nov 2023 21:58:54 -0800 Subject: [PATCH 02/11] fix spacing on parentheses --- source/viz.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/viz.md b/source/viz.md index a2e05237..12bb8de2 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1047,7 +1047,7 @@ can_lang_plot_theme = alt.Chart(can_lang).mark_point(filled=True).encode( y=alt.Y("mother_tongue_percent") .scale(type="log") .axis(tickCount=7) - .title("Mother tongue(percentage of Canadian residents)"), + .title("Mother tongue (percentage of Canadian residents)"), color=alt.Color("category") .legend(orient="top") .title("") @@ -1089,7 +1089,7 @@ can_lang_plot_tooltip = alt.Chart(can_lang).mark_point(filled=True).encode( y=alt.Y("mother_tongue_percent") .scale(type="log") .axis(tickCount=7) - .title("Mother tongue(percentage of Canadian residents)"), + .title("Mother tongue (percentage of Canadian residents)"), color=alt.Color("category") .legend(orient="top") .title("") From 33dbe9818e225e5bbf8353dc09a9073c895cde9e Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Fri, 10 Nov 2023 22:03:14 -0800 Subject: [PATCH 03/11] landmass plot caption improvement/typo fix --- source/viz.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/viz.md b/source/viz.md index 12bb8de2..f444515d 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1295,7 +1295,7 @@ glue("islands_plot_sorted", islands_plot_sorted, display=True) :figwidth: 700px :name: islands_plot_sorted -Bar plot of size for Earth's largest 12 landmasses colored by whether its a continent with clearer axes and labels. +Bar plot of size for Earth's largest 12 landmasses, colored by landmass type, with clearer axes and labels. ::: From ab817d5f65162c8431d5277d27d6a2dd3cd6a2f8 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 11 Nov 2023 13:57:28 -0800 Subject: [PATCH 04/11] remove old barplot text --- source/viz.md | 9 --------- 1 file changed, 9 deletions(-) diff --git a/source/viz.md b/source/viz.md index f444515d..ff93e4c2 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1160,15 +1160,6 @@ islands_df = pd.read_csv("data/islands.csv") islands_df ``` -Here, we have a data frame of Earth's landmasses, -and are trying to compare their sizes. -The right type of visualization to answer this question is a bar plot. -In a bar plot, the height of the bar represents the value of a summary statistic -(usually a size, count, sum, proportion, or percentage). -They are particularly useful for comparing summary statistics between different -groups of a categorical variable. - - Here, we have a data frame of Earth's landmasses, and are trying to compare their sizes. The right type of visualization to answer this question is a bar plot. From cf552608e01df7133f785b9ebf833da4c0267739 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 11 Nov 2023 14:08:15 -0800 Subject: [PATCH 05/11] fix captions --- source/viz.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/viz.md b/source/viz.md index ff93e4c2..3e665786 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1066,7 +1066,7 @@ glue("can_lang_plot_theme", can_lang_plot_theme.properties(height=320, width=420 :figwidth: 700px :name: can_lang_plot_theme -Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with custom colors. +Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with custom colors and shapes. ::: The chart above gives a good indication of how the different language categories differ, @@ -1194,7 +1194,7 @@ glue("islands_bar", islands_bar, display=False) :figwidth: 400px :name: islands_bar -Bar plot of all Earth's landmasses' size with squished labels. +Bar plot of Earth's landmass sizes. The plot is too wide with the default settings. ::: Alright, not bad! The plot in {numref}`islands_bar` is From 1a4cac7917fccf8ca8e095a229f066583cbe883a Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 11 Nov 2023 14:36:48 -0800 Subject: [PATCH 06/11] improved barplot text in viz; title setting --- source/viz.md | 43 ++++++++++++++++++++++++++++++------------- 1 file changed, 30 insertions(+), 13 deletions(-) diff --git a/source/viz.md b/source/viz.md index 3e665786..5f91d824 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1247,12 +1247,10 @@ and allows us to answer our initial questions: "Are the seven continents Earth's largest landmasses?" and "Which are the next few largest landmasses?". However, we could still improve this visualization -by organizing the bars by landmass size rather than by alphabetical order -and by coloring the bars based on whether they correspond to a continent. -The data for this is stored in the `landmass_type` column. -To use this to color the bars, +by coloring the bars based on whether they correspond to a continent, and +by organizing the bars by landmass size rather than by alphabetical order. +The data for coloring the bars is stored in the `landmass_type` column, so we set the `color` encoding to `landmass_type`. - To organize the landmasses by their `size` variable, we will use the altair `sort` function in the y-encoding of the chart. @@ -1262,18 +1260,37 @@ This plots the values on `y` axis in the ascending order of `x` axis values. This creates a chart where the largest bar is the closest to the axis line, which is generally the most visually appealing when sorting bars. -If instead -we want to sort the values on `y-axis` in descending order of `x-axis`, -we can add a minus sign to reverse the order and specify `sort="-x"`. +If instead we wanted to sort the values on `y-axis` in descending order of `x-axis`, +we could add a minus sign to reverse the order and specify `sort="-x"`. ```{index} altair; sort ``` -```{code-cell} ipython3 -islands_plot_sorted = alt.Chart(islands_top12).mark_bar().encode( - x="size", - y=alt.Y("landmass").sort("x"), - color=alt.Color("landmass_type") +To finalize this plot we will customize the axis and legend labels using the `title` method, +and add a title to the chart by specifying the `title` argument of `alt.Chart`. +Plot titles are not always required, especially when it would be redundant with an already-existing +caption or surrounding context (e.g., in a slide presentation with annotations). +But if you decide to include one, a good plot title should provide the take home message +that you want readers to focus on, e.g., "The Earth's seven largest landmasses are all continents," +but it could also more general, e.g., "The twelve largest landmasses on Earth." + +Note that +For categorical encodings, +such as the color and y channels in our chart, +it is often not necessary to include the axis title +as the labels of the categories are enough by themselves. +Particularly in this case where the title clearly states +that we are landmasses, +the titles are redundant and we can remove them. + +```{code-cell} ipython3 +islands_plot_sorted = alt.Chart( + islands_top12, + title="The Earth's seven largest landmasses are all continents" +).mark_bar().encode( + x=alt.X("size").title("Size (1000 square mi)"), + y=alt.Y("landmass").sort("x").title("Landmass"), + color=alt.Color("landmass_type").title("Type") ) ``` From 3f7b61ef02c9902452e1d2735f8536b54238c2c9 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 11 Nov 2023 14:46:30 -0800 Subject: [PATCH 07/11] polish on barplot title --- source/viz.md | 13 ++----------- 1 file changed, 2 insertions(+), 11 deletions(-) diff --git a/source/viz.md b/source/viz.md index 5f91d824..38842ee2 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1269,19 +1269,10 @@ we could add a minus sign to reverse the order and specify `sort="-x"`. To finalize this plot we will customize the axis and legend labels using the `title` method, and add a title to the chart by specifying the `title` argument of `alt.Chart`. Plot titles are not always required, especially when it would be redundant with an already-existing -caption or surrounding context (e.g., in a slide presentation with annotations). +caption or surrounding context (e.g., in a slide presentation with annotations). But if you decide to include one, a good plot title should provide the take home message that you want readers to focus on, e.g., "The Earth's seven largest landmasses are all continents," -but it could also more general, e.g., "The twelve largest landmasses on Earth." - -Note that -For categorical encodings, -such as the color and y channels in our chart, -it is often not necessary to include the axis title -as the labels of the categories are enough by themselves. -Particularly in this case where the title clearly states -that we are landmasses, -the titles are redundant and we can remove them. +or a more general summary of the information displayed, e.g., "The twelve largest landmasses on Earth." ```{code-cell} ipython3 islands_plot_sorted = alt.Chart( From a2cb168e1f2c786f6088b77f8a33293d4cbd8203 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 11 Nov 2023 15:25:04 -0800 Subject: [PATCH 08/11] shortened barplot title --- source/viz.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/viz.md b/source/viz.md index 38842ee2..3db55dc6 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1271,13 +1271,13 @@ and add a title to the chart by specifying the `title` argument of `alt.Chart`. Plot titles are not always required, especially when it would be redundant with an already-existing caption or surrounding context (e.g., in a slide presentation with annotations). But if you decide to include one, a good plot title should provide the take home message -that you want readers to focus on, e.g., "The Earth's seven largest landmasses are all continents," +that you want readers to focus on, e.g., "Earth's seven largest landmasses are all continents," or a more general summary of the information displayed, e.g., "The twelve largest landmasses on Earth." ```{code-cell} ipython3 islands_plot_sorted = alt.Chart( islands_top12, - title="The Earth's seven largest landmasses are all continents" + title="Earth's seven largest landmasses are all continents" ).mark_bar().encode( x=alt.X("size").title("Size (1000 square mi)"), y=alt.Y("landmass").sort("x").title("Landmass"), From b5c35cb09a38c5b6cf3d12a5a8a2d68ec52b4f0a Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 11 Nov 2023 16:45:03 -0800 Subject: [PATCH 09/11] consistency with R (which needs shorter titles to keep consistent font sizes...) --- source/viz.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/source/viz.md b/source/viz.md index 3db55dc6..21784b28 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1271,13 +1271,13 @@ and add a title to the chart by specifying the `title` argument of `alt.Chart`. Plot titles are not always required, especially when it would be redundant with an already-existing caption or surrounding context (e.g., in a slide presentation with annotations). But if you decide to include one, a good plot title should provide the take home message -that you want readers to focus on, e.g., "Earth's seven largest landmasses are all continents," +that you want readers to focus on, e.g., "Earth's seven largest landmasses are continents," or a more general summary of the information displayed, e.g., "The twelve largest landmasses on Earth." ```{code-cell} ipython3 islands_plot_sorted = alt.Chart( islands_top12, - title="Earth's seven largest landmasses are all continents" + title="Earth's seven largest landmasses are continents" ).mark_bar().encode( x=alt.X("size").title("Size (1000 square mi)"), y=alt.Y("landmass").sort("x").title("Landmass"), From 02406a2f06fe3bbd34e92c9d01384e988f2c33d2 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 11 Nov 2023 16:58:11 -0800 Subject: [PATCH 10/11] remove end of line spaces --- source/viz.md | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/source/viz.md b/source/viz.md index 21784b28..caf2b458 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1112,7 +1112,7 @@ else: :figwidth: 700px :name: can_lang_plot_tooltip -Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with custom colors and mouse hover tooltip. +Scatter plot of percentage of Canadians reporting a language as their mother tongue vs the primary language at home colored by language category with custom colors and mouse hover tooltip. ::: From the visualization in {numref}`can_lang_plot_tooltip`, @@ -1160,15 +1160,15 @@ islands_df = pd.read_csv("data/islands.csv") islands_df ``` -Here, we have a data frame of Earth's landmasses, -and are trying to compare their sizes. -The right type of visualization to answer this question is a bar plot. +Here, we have a data frame of Earth's landmasses, +and are trying to compare their sizes. +The right type of visualization to answer this question is a bar plot. In a bar plot, the height of each bar represents the value of an *amount* (a size, count, proportion, percentage, etc). They are particularly useful for comparing counts or proportions across different -groups of a categorical variable. Note, however, that bar plots should generally not be +groups of a categorical variable. Note, however, that bar plots should generally not be used to display mean or median values, as they hide important information about -the variation of the data. Instead it's better to show the distribution of +the variation of the data. Instead it's better to show the distribution of all the individual data points, e.g., using a histogram, which we will discuss further in {numref}`histogramsviz`. ```{index} altair; mark_bar @@ -1212,7 +1212,7 @@ so that the labels are on the y-axis and we don't have to tilt our head to read ```{note} Recall that in {numref}`Chapter %s `, we used `sort_values` followed by `head` to obtain the ten rows with the largest values of a variable. We could have instead used the `nlargest` function -from `pandas` for this purpose. The `nsmallest` and `nlargest` functions achieve the same goal +from `pandas` for this purpose. The `nsmallest` and `nlargest` functions achieve the same goal as `sort_values` followed by `head`, but are slightly more efficient because they are specialized for this purpose. In general, it is good to use more specialized functions when they are available! ``` @@ -1360,7 +1360,7 @@ Note that this time, we are setting the `y` encoding to `"count()"`. There is no `"count()"` column-name in `morley_df`; we use `"count()"` to tell `altair` -that we want to count the number of occurrences of each value in along the x-axis +that we want to count the number of occurrences of each value in along the x-axis (which we encoded as the `Speed` column). ```{code-cell} ipython3 From d14311acb2f375277f363251c8ad2bc99a619778 Mon Sep 17 00:00:00 2001 From: Trevor Campbell Date: Sat, 11 Nov 2023 17:27:25 -0800 Subject: [PATCH 11/11] minor adjustment to align with R --- source/viz.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/source/viz.md b/source/viz.md index caf2b458..37997a1f 100755 --- a/source/viz.md +++ b/source/viz.md @@ -1272,7 +1272,7 @@ Plot titles are not always required, especially when it would be redundant with caption or surrounding context (e.g., in a slide presentation with annotations). But if you decide to include one, a good plot title should provide the take home message that you want readers to focus on, e.g., "Earth's seven largest landmasses are continents," -or a more general summary of the information displayed, e.g., "The twelve largest landmasses on Earth." +or a more general summary of the information displayed, e.g., "Earth's twelve largest landmasses." ```{code-cell} ipython3 islands_plot_sorted = alt.Chart(