Skip to content

Commit f87f0c0

Browse files
Ch1 fig cleanup (#99)
* first figures in ch1: * code figures for ch1, including ppt to edit them * update figure sizes * remove old lingering image * removed hidden pptx cache file Co-authored-by: Trevor Campbell <[email protected]>
1 parent f755a9a commit f87f0c0

8 files changed

+47
-41
lines changed

source/img/altair_syntax.png

180 KB
Loading

source/img/code-figures.pptx

194 KB
Binary file not shown.

source/img/filter_rows.png

152 KB
Loading
155 KB
Loading

source/img/read_csv_function.png

112 KB
Loading

source/img/select_columns.png

91.7 KB
Loading

source/img/sort_values.png

146 KB
Loading

source/intro.md

+47-41
Original file line numberDiff line numberDiff line change
@@ -224,7 +224,7 @@ affiliations, cities' populations, etc.
224224

225225
```{figure} img/spreadsheet_vs_df.png
226226
---
227-
height: 400px
227+
height: 500px
228228
name: img-spreadsheet-vs-data frame
229229
---
230230
A spreadsheet versus a data frame in Python
@@ -311,11 +311,9 @@ file satisfies everything else that the `read_csv` function expects in the defau
311311
use-case. {numref}`img-read-csv` describes how we use the `read_csv`
312312
to read data into Python.
313313

314-
**(FIGURE 1.2 FROM R BOOK IS NOT MISSING, BUT STILL R VERSION. NEEDS PD.READ_CSV)**
315-
316-
```{figure} img/read_csv_function.jpeg
314+
```{figure} img/read_csv_function.png
317315
---
318-
height: 200px
316+
height: 220px
319317
name: img-read-csv
320318
---
321319
Syntax for the `read_csv` function
@@ -324,6 +322,7 @@ Syntax for the `read_csv` function
324322

325323
+++
326324
```{code-cell} ipython3
325+
:tags: ["output_scroll"]
327326
pd.read_csv("data/can_lang.csv")
328327
329328
```
@@ -426,6 +425,7 @@ variables (i.e., columns) are printed just underneath the data frame (214 rows a
426425
Printing a few rows from data frame like this is a handy way to get a quick sense for what is contained in it.
427426

428427
```{code-cell} ipython3
428+
:tags: ["output_scroll"]
429429
can_lang
430430
```
431431

@@ -486,11 +486,9 @@ or one of the names we have given to objects in the code we have already written
486486
> of `"Aboriginal languages"` above, or `'category'` instead of `"category"`.
487487
> Try both out for yourself!
488488
489-
**(This figure is wrong-- should be for [] operation below)**
490-
491-
```{figure} img/read_csv_function.jpeg
489+
```{figure} img/filter_rows.png
492490
---
493-
height: 200px
491+
height: 220px
494492
name: img-filter
495493
---
496494
Syntax for using the `[]` operation to filter rows.
@@ -500,6 +498,7 @@ This operation returns a data frame that has all the columns of the input data f
500498
but only those rows corresponding to Aboriginal languages that we asked for in the logical statement.
501499

502500
```{code-cell} ipython3
501+
:tags: ["output_scroll"]
503502
can_lang[can_lang["category"] == "Aboriginal languages"]
504503
```
505504

@@ -519,11 +518,9 @@ selecting only the `language` and `mother_tongue` columns from our original
519518
`can_lang` data frame, we put the list `["language", "mother_tongue"]`
520519
containing those two column names inside the square brackets of the `[]` operation.
521520

522-
**(This figure is wrong-- should be for [] operation below)**
523-
524-
```{figure} img/read_csv_function.jpeg
521+
```{figure} img/select_columns.png
525522
---
526-
height: 200px
523+
height: 220px
527524
name: img-select
528525
---
529526
Syntax for using the `[]` operation to select columns.
@@ -553,18 +550,18 @@ that with the `.loc[]` method. Inside the square brackets,
553550
we write our row filtering logical statement,
554551
then a comma, then our list of columns to select.
555552

556-
**(This figure is wrong-- should be for .loc[] operation below)**
557-
558-
```{figure} img/read_csv_function.jpeg
553+
```{figure} img/filter_rows_and_columns.png
559554
---
560-
height: 200px
555+
height: 220px
561556
name: img-loc
562557
---
563558
Syntax for using the `loc[]` operation to filter rows and select columns.
564559
```
565560

566561
```{code-cell} ipython3
567-
aboriginal_lang = can_lang.loc[can_lang["category"] == "Aboriginal languages", ["language", "mother_tongue"]]
562+
aboriginal_lang = can_lang.loc[
563+
can_lang["category"] == "Aboriginal languages", ["language", "mother_tongue"]
564+
]
568565
```
569566
There is one very important thing to notice in this code example.
570567
The first is that we used the `loc[]` operation on the `can_lang` data frame by
@@ -610,7 +607,13 @@ language, we will use the `sort_values` function to order the rows in our
610607
arrange the rows in descending order (from largest to smallest),
611608
so we specify the argument `ascending` as `False`.
612609

613-
**(FIGURE 1.5 FROM R BOOK MISSING HERE)**
610+
```{figure} img/sort_values.png
611+
---
612+
height: 220px
613+
name: img-sort-values
614+
---
615+
Syntax for using `sort_values` to arrange rows in decending order.
616+
```
614617

615618
```{code-cell} ipython3
616619
arranged_lang = aboriginal_lang.sort_values(by='mother_tongue', ascending=False)
@@ -636,8 +639,8 @@ ten_lang
636639
It took us 3 steps to find the ten Aboriginal languages most often reported in
637640
2016 as mother tongues in Canada. Starting from the `can_lang` data frame, we:
638641

639-
1) used `loc` to filter the rows so that only the
640-
`Aboriginal languages` category remained, and selected the
642+
1) used `loc` to filter the rows so that only the
643+
`Aboriginal languages` category remained, and selected the
641644
`language` and `mother_tongue` columns,
642645
2) used `sort_values` to sort the rows by `mother_tongue` in descending order, and
643646
3) obtained only the top 10 values using `head`.
@@ -659,30 +662,30 @@ It is hard to keep track of what methods are being called, and what arguments we
659662
Second, each line introduces a new temporary object. In this case, both `aboriginal_lang` and `arranged_lang_sorted`
660663
are just temporary results on the way to producing the `ten_lang` data frame.
661664
This makes the code hard to read, as one has to trace where each temporary object
662-
goes, and hard to understand, since introducing many named objects also suggests that they
665+
goes, and hard to understand, since introducing many named objects also suggests that they
663666
are of some importance, when really they are just intermediates.
664667
The need to call multiple methods in a sequence to process a data frame is
665668
quite common, so this is an important issue to address!
666669

667670
To solve the first problem, we can actually split the long expressions above across
668671
multiple lines. Although in most cases, a single expression in Python must be contained
669-
in a single line of code, there are a small number of situations where lets us do this.
672+
in a single line of code, there are a small number of situations where lets us do this.
670673
Let's rewrite this code in a more readable format using multiline expressions.
671674

672675
```{code-cell} ipython3
673676
aboriginal_lang = can_lang.loc[
674-
can_lang["category"] == "Aboriginal languages",
675-
["language", "mother_tongue"]]
677+
can_lang["category"] == "Aboriginal languages", ["language", "mother_tongue"]
678+
]
676679
arranged_lang_sorted = aboriginal_lang.sort_values(
677-
by='mother_tongue',
678-
ascending=False)
680+
by='mother_tongue', ascending=False
681+
)
679682
ten_lang = arranged_lang_sorted.head(10)
680683
```
681684

682685
This code is the same as the code we showed earlier; you can see the same
683686
sequence of methods and arguments is used. But long expressions are split
684687
across multiple lines when they would otherwise get long and unwieldy,
685-
improving the readability of the code.
688+
improving the readability of the code.
686689
How does Python know when to keep
687690
reading on the next line for a single expression?
688691
For the line starting with `aboriginal_lang = ...`, Python sees that the line ends with a left
@@ -692,7 +695,7 @@ We put the same two arguments as we did before, and then
692695
the corresponding right bracket appears after `["language", "mother_tongue"]`).
693696
For the line starting with `arranged_lang_sorted = ...`, Python sees that the line ends with a left parenthesis symbol `(`,
694697
and knows the expression cannot end until we close it with the corresponding right parenthesis symbol `)`.
695-
Again we use the same two arguments as before, and then the
698+
Again we use the same two arguments as before, and then the
696699
corresponding right parenthesis appears right after `ascending=False`.
697700
In both cases, Python keeps reading the next line to figure out
698701
what the rest of the expression is. We could, of course,
@@ -701,7 +704,7 @@ multiple lines helps a lot with code readability.
701704

702705
We still have to handle the issue that each line of code---i.e., each step in the analysis---introduces
703706
a new temporary object. To address this issue, we can *chain* multiple operations together without
704-
assigning intermediate objects. The key idea of chaining is that the *output* of
707+
assigning intermediate objects. The key idea of chaining is that the *output* of
705708
each step in the analysis is a data frame, which means that you can just directly keep calling methods
706709
that operate on the output of each step in a sequence! This simplifies the code and makes it
707710
easier to read. The code below demonstrates the use of both multiline expressions and chaining together.
@@ -712,7 +715,7 @@ from the messy code above!
712715
# obtain the 10 most common Aboriginal languages
713716
ten_lang = (
714717
can_lang.loc[
715-
can_lang["category"] == "Aboriginal languages",
718+
can_lang["category"] == "Aboriginal languages",
716719
["language", "mother_tongue"]
717720
]
718721
.sort_values(by="mother_tongue", ascending=False)
@@ -721,15 +724,15 @@ ten_lang = (
721724
ten_lang
722725
```
723726

724-
Let's parse this new block of code piece by piece.
727+
Let's parse this new block of code piece by piece.
725728
The code above starts with a left parenthesis, `(`, and so Python
726729
knows to keep reading to subsequent lines until it finds the corresponding
727730
right parenthesis symbol `)`. The `loc` method performs the filtering and selecting steps as before. The line after this
728-
starts with a period (`.`) that "chains" the output of the `loc` step with the next operation,
729-
`sort_values`. Since the output of `loc` is a data frame, we can use the `sort_values` method on it
731+
starts with a period (`.`) that "chains" the output of the `loc` step with the next operation,
732+
`sort_values`. Since the output of `loc` is a data frame, we can use the `sort_values` method on it
730733
without first giving it a name! That is what the `.sort_values` does on the next line.
731734
Finally, we once again "chain" together the output of `sort_values` with `head` to ask for the 10
732-
most common languages. Finally, the right parenthesis `)` corresponding to the very first left parenthesis
735+
most common languages. Finally, the right parenthesis `)` corresponding to the very first left parenthesis
733736
appears on the second last line, completing the multiline expression.
734737
Instead of creating intermediate objects, with chaining, we take the output of
735738
one operation and use that to perform the next operation. In doing so, we remove the need to create and
@@ -811,19 +814,22 @@ the `x` (represents the x-axis position of the points) and
811814
function to handle this: we specify that the `language` column should correspond to the x-axis,
812815
and that the `mother_tongue` column should correspond to the y-axis.
813816

814-
**(FIGURE 1.6 FROM R BOOK IS MISSING)**
817+
```{figure} img/altair_syntax.png
818+
---
819+
height: 220px
820+
name: img-altair
821+
---
822+
Syntax for using `altair` to make a bar chart.
823+
```
815824

816825
+++
817826

818827
```{code-cell} ipython3
819828
:tags: []
820829
821830
barplot_mother_tongue = (
822-
alt.Chart(ten_lang)
823-
.mark_bar().encode(
824-
x="language",
825-
y="mother_tongue"
826-
))
831+
alt.Chart(ten_lang).mark_bar().encode(x="language", y="mother_tongue")
832+
)
827833
828834
829835
```

0 commit comments

Comments
 (0)