@@ -224,7 +224,7 @@ affiliations, cities' populations, etc.
224
224
225
225
``` {figure} img/spreadsheet_vs_df.png
226
226
---
227
- height: 400px
227
+ height: 500px
228
228
name: img-spreadsheet-vs-data frame
229
229
---
230
230
A spreadsheet versus a data frame in Python
@@ -311,11 +311,9 @@ file satisfies everything else that the `read_csv` function expects in the defau
311
311
use-case. {numref}` img-read-csv ` describes how we use the ` read_csv `
312
312
to read data into Python.
313
313
314
- ** (FIGURE 1.2 FROM R BOOK IS NOT MISSING, BUT STILL R VERSION. NEEDS PD.READ_CSV)**
315
-
316
- ``` {figure} img/read_csv_function.jpeg
314
+ ``` {figure} img/read_csv_function.png
317
315
---
318
- height: 200px
316
+ height: 220px
319
317
name: img-read-csv
320
318
---
321
319
Syntax for the `read_csv` function
@@ -324,6 +322,7 @@ Syntax for the `read_csv` function
324
322
325
323
+++
326
324
``` {code-cell} ipython3
325
+ :tags: ["output_scroll"]
327
326
pd.read_csv("data/can_lang.csv")
328
327
329
328
```
@@ -426,6 +425,7 @@ variables (i.e., columns) are printed just underneath the data frame (214 rows a
426
425
Printing a few rows from data frame like this is a handy way to get a quick sense for what is contained in it.
427
426
428
427
``` {code-cell} ipython3
428
+ :tags: ["output_scroll"]
429
429
can_lang
430
430
```
431
431
@@ -486,11 +486,9 @@ or one of the names we have given to objects in the code we have already written
486
486
> of ` "Aboriginal languages" ` above, or ` 'category' ` instead of ` "category" ` .
487
487
> Try both out for yourself!
488
488
489
- ** (This figure is wrong-- should be for [ ] operation below)**
490
-
491
- ``` {figure} img/read_csv_function.jpeg
489
+ ``` {figure} img/filter_rows.png
492
490
---
493
- height: 200px
491
+ height: 220px
494
492
name: img-filter
495
493
---
496
494
Syntax for using the `[]` operation to filter rows.
@@ -500,6 +498,7 @@ This operation returns a data frame that has all the columns of the input data f
500
498
but only those rows corresponding to Aboriginal languages that we asked for in the logical statement.
501
499
502
500
``` {code-cell} ipython3
501
+ :tags: ["output_scroll"]
503
502
can_lang[can_lang["category"] == "Aboriginal languages"]
504
503
```
505
504
@@ -519,11 +518,9 @@ selecting only the `language` and `mother_tongue` columns from our original
519
518
` can_lang ` data frame, we put the list ` ["language", "mother_tongue"] `
520
519
containing those two column names inside the square brackets of the ` [] ` operation.
521
520
522
- ** (This figure is wrong-- should be for [ ] operation below)**
523
-
524
- ``` {figure} img/read_csv_function.jpeg
521
+ ``` {figure} img/select_columns.png
525
522
---
526
- height: 200px
523
+ height: 220px
527
524
name: img-select
528
525
---
529
526
Syntax for using the `[]` operation to select columns.
@@ -553,18 +550,18 @@ that with the `.loc[]` method. Inside the square brackets,
553
550
we write our row filtering logical statement,
554
551
then a comma, then our list of columns to select.
555
552
556
- ** (This figure is wrong-- should be for .loc[ ] operation below)**
557
-
558
- ``` {figure} img/read_csv_function.jpeg
553
+ ``` {figure} img/filter_rows_and_columns.png
559
554
---
560
- height: 200px
555
+ height: 220px
561
556
name: img-loc
562
557
---
563
558
Syntax for using the `loc[]` operation to filter rows and select columns.
564
559
```
565
560
566
561
``` {code-cell} ipython3
567
- aboriginal_lang = can_lang.loc[can_lang["category"] == "Aboriginal languages", ["language", "mother_tongue"]]
562
+ aboriginal_lang = can_lang.loc[
563
+ can_lang["category"] == "Aboriginal languages", ["language", "mother_tongue"]
564
+ ]
568
565
```
569
566
There is one very important thing to notice in this code example.
570
567
The first is that we used the ` loc[] ` operation on the ` can_lang ` data frame by
@@ -610,7 +607,13 @@ language, we will use the `sort_values` function to order the rows in our
610
607
arrange the rows in descending order (from largest to smallest),
611
608
so we specify the argument ` ascending ` as ` False ` .
612
609
613
- ** (FIGURE 1.5 FROM R BOOK MISSING HERE)**
610
+ ``` {figure} img/sort_values.png
611
+ ---
612
+ height: 220px
613
+ name: img-sort-values
614
+ ---
615
+ Syntax for using `sort_values` to arrange rows in decending order.
616
+ ```
614
617
615
618
``` {code-cell} ipython3
616
619
arranged_lang = aboriginal_lang.sort_values(by='mother_tongue', ascending=False)
@@ -636,8 +639,8 @@ ten_lang
636
639
It took us 3 steps to find the ten Aboriginal languages most often reported in
637
640
2016 as mother tongues in Canada. Starting from the ` can_lang ` data frame, we:
638
641
639
- 1 ) used ` loc ` to filter the rows so that only the
640
- ` Aboriginal languages ` category remained, and selected the
642
+ 1 ) used ` loc ` to filter the rows so that only the
643
+ ` Aboriginal languages ` category remained, and selected the
641
644
` language ` and ` mother_tongue ` columns,
642
645
2 ) used ` sort_values ` to sort the rows by ` mother_tongue ` in descending order, and
643
646
3 ) obtained only the top 10 values using ` head ` .
@@ -659,30 +662,30 @@ It is hard to keep track of what methods are being called, and what arguments we
659
662
Second, each line introduces a new temporary object. In this case, both ` aboriginal_lang ` and ` arranged_lang_sorted `
660
663
are just temporary results on the way to producing the ` ten_lang ` data frame.
661
664
This makes the code hard to read, as one has to trace where each temporary object
662
- goes, and hard to understand, since introducing many named objects also suggests that they
665
+ goes, and hard to understand, since introducing many named objects also suggests that they
663
666
are of some importance, when really they are just intermediates.
664
667
The need to call multiple methods in a sequence to process a data frame is
665
668
quite common, so this is an important issue to address!
666
669
667
670
To solve the first problem, we can actually split the long expressions above across
668
671
multiple lines. Although in most cases, a single expression in Python must be contained
669
- in a single line of code, there are a small number of situations where lets us do this.
672
+ in a single line of code, there are a small number of situations where lets us do this.
670
673
Let's rewrite this code in a more readable format using multiline expressions.
671
674
672
675
``` {code-cell} ipython3
673
676
aboriginal_lang = can_lang.loc[
674
- can_lang["category"] == "Aboriginal languages",
675
- ["language", "mother_tongue"] ]
677
+ can_lang["category"] == "Aboriginal languages", ["language", "mother_tongue"]
678
+ ]
676
679
arranged_lang_sorted = aboriginal_lang.sort_values(
677
- by='mother_tongue',
678
- ascending=False )
680
+ by='mother_tongue', ascending=False
681
+ )
679
682
ten_lang = arranged_lang_sorted.head(10)
680
683
```
681
684
682
685
This code is the same as the code we showed earlier; you can see the same
683
686
sequence of methods and arguments is used. But long expressions are split
684
687
across multiple lines when they would otherwise get long and unwieldy,
685
- improving the readability of the code.
688
+ improving the readability of the code.
686
689
How does Python know when to keep
687
690
reading on the next line for a single expression?
688
691
For the line starting with ` aboriginal_lang = ... ` , Python sees that the line ends with a left
@@ -692,7 +695,7 @@ We put the same two arguments as we did before, and then
692
695
the corresponding right bracket appears after ` ["language", "mother_tongue"] ` ).
693
696
For the line starting with ` arranged_lang_sorted = ... ` , Python sees that the line ends with a left parenthesis symbol ` ( ` ,
694
697
and knows the expression cannot end until we close it with the corresponding right parenthesis symbol ` ) ` .
695
- Again we use the same two arguments as before, and then the
698
+ Again we use the same two arguments as before, and then the
696
699
corresponding right parenthesis appears right after ` ascending=False ` .
697
700
In both cases, Python keeps reading the next line to figure out
698
701
what the rest of the expression is. We could, of course,
@@ -701,7 +704,7 @@ multiple lines helps a lot with code readability.
701
704
702
705
We still have to handle the issue that each line of code---i.e., each step in the analysis---introduces
703
706
a new temporary object. To address this issue, we can * chain* multiple operations together without
704
- assigning intermediate objects. The key idea of chaining is that the * output* of
707
+ assigning intermediate objects. The key idea of chaining is that the * output* of
705
708
each step in the analysis is a data frame, which means that you can just directly keep calling methods
706
709
that operate on the output of each step in a sequence! This simplifies the code and makes it
707
710
easier to read. The code below demonstrates the use of both multiline expressions and chaining together.
@@ -712,7 +715,7 @@ from the messy code above!
712
715
# obtain the 10 most common Aboriginal languages
713
716
ten_lang = (
714
717
can_lang.loc[
715
- can_lang["category"] == "Aboriginal languages",
718
+ can_lang["category"] == "Aboriginal languages",
716
719
["language", "mother_tongue"]
717
720
]
718
721
.sort_values(by="mother_tongue", ascending=False)
@@ -721,15 +724,15 @@ ten_lang = (
721
724
ten_lang
722
725
```
723
726
724
- Let's parse this new block of code piece by piece.
727
+ Let's parse this new block of code piece by piece.
725
728
The code above starts with a left parenthesis, ` ( ` , and so Python
726
729
knows to keep reading to subsequent lines until it finds the corresponding
727
730
right parenthesis symbol ` ) ` . The ` loc ` method performs the filtering and selecting steps as before. The line after this
728
- starts with a period (` . ` ) that "chains" the output of the ` loc ` step with the next operation,
729
- ` sort_values ` . Since the output of ` loc ` is a data frame, we can use the ` sort_values ` method on it
731
+ starts with a period (` . ` ) that "chains" the output of the ` loc ` step with the next operation,
732
+ ` sort_values ` . Since the output of ` loc ` is a data frame, we can use the ` sort_values ` method on it
730
733
without first giving it a name! That is what the ` .sort_values ` does on the next line.
731
734
Finally, we once again "chain" together the output of ` sort_values ` with ` head ` to ask for the 10
732
- most common languages. Finally, the right parenthesis ` ) ` corresponding to the very first left parenthesis
735
+ most common languages. Finally, the right parenthesis ` ) ` corresponding to the very first left parenthesis
733
736
appears on the second last line, completing the multiline expression.
734
737
Instead of creating intermediate objects, with chaining, we take the output of
735
738
one operation and use that to perform the next operation. In doing so, we remove the need to create and
@@ -811,19 +814,22 @@ the `x` (represents the x-axis position of the points) and
811
814
function to handle this: we specify that the ` language ` column should correspond to the x-axis,
812
815
and that the ` mother_tongue ` column should correspond to the y-axis.
813
816
814
- ** (FIGURE 1.6 FROM R BOOK IS MISSING)**
817
+ ``` {figure} img/altair_syntax.png
818
+ ---
819
+ height: 220px
820
+ name: img-altair
821
+ ---
822
+ Syntax for using `altair` to make a bar chart.
823
+ ```
815
824
816
825
+++
817
826
818
827
``` {code-cell} ipython3
819
828
:tags: []
820
829
821
830
barplot_mother_tongue = (
822
- alt.Chart(ten_lang)
823
- .mark_bar().encode(
824
- x="language",
825
- y="mother_tongue"
826
- ))
831
+ alt.Chart(ten_lang).mark_bar().encode(x="language", y="mother_tongue")
832
+ )
827
833
828
834
829
835
```
0 commit comments