@@ -60,8 +60,8 @@ By the end of the chapter, readers will be able to do the following:
60
60
- ` and `
61
61
- ` or `
62
62
- ` [] `
63
- - ` .iloc[] `
64
63
- ` .loc[] `
64
+ - ` .iloc[] `
65
65
66
66
## Data frames, series, and lists
67
67
@@ -881,25 +881,6 @@ pd.Series(["Vancouver", "Toronto"]) == pd.Series(["Toronto", "Vancouver"])
881
881
pd.Series(["Vancouver", "Toronto"]).isin(pd.Series(["Toronto", "Vancouver"]))
882
882
```
883
883
884
- ``` {code-cell} ipython3
885
- :tags: [remove-cell]
886
-
887
- # > **Note:** What's the difference between `==` and `%in%`? Suppose we have two
888
- # > vectors, `vectorA` and `vectorB`. If you type `vectorA == vectorB` into R it
889
- # > will compare the vectors element by element. R checks if the first element of
890
- # > `vectorA` equals the first element of `vectorB`, the second element of
891
- # > `vectorA` equals the second element of `vectorB`, and so on. On the other hand,
892
- # > `vectorA %in% vectorB` compares the first element of `vectorA` to all the
893
- # > elements in `vectorB`. Then the second element of `vectorA` is compared
894
- # > to all the elements in `vectorB`, and so on. Notice the difference between `==` and
895
- # > `%in%` in the example below.
896
- # >
897
- # >``` {r}
898
- # >c("Vancouver", "Toronto") == c("Toronto", "Vancouver")
899
- # >c("Vancouver", "Toronto") %in% c("Toronto", "Vancouver")
900
- # >```
901
- ```
902
-
903
884
### Extracting rows above or below a threshold using ` > ` and ` < `
904
885
905
886
``` {code-cell} ipython3
@@ -928,6 +909,19 @@ only English in Toronto is reported by more people
928
909
as their primary language at home
929
910
than French in Montréal according to the 2016 Canadian census.
930
911
912
+ ### Extracting rows using ` .query() `
913
+
914
+ You can also extract rows above, below, equal or not-equal to a threshold using the
915
+ ` .query() ` method. For example the following gives us the same result as when we used
916
+ ` official_langs[official_langs["most_at_home"] > 2669195] ` .
917
+
918
+ ``` {code-cell} ipython3
919
+ official_langs.query("most_at_home > 2669195")
920
+ ```
921
+
922
+ The query (criteria we are using to select values) is input as a string. This will
923
+ come in handy when we later talk about chaining.
924
+
931
925
(loc-iloc)=
932
926
## Using ` .loc[] ` to filter rows and select columns.
933
927
``` {index} pandas.DataFrame; loc[]
@@ -1285,17 +1279,6 @@ multiple lines of code, storing temporary objects as you go:
1285
1279
``` {code-cell} ipython3
1286
1280
:tags: [remove-cell]
1287
1281
1288
- # ## Combining functions using the pipe operator, `|>`
1289
-
1290
- # In R, we often have to call multiple functions in a sequence to process a data
1291
- # frame. The basic ways of doing this can become quickly unreadable if there are
1292
- # many steps. For example, suppose we need to perform three operations on a data
1293
- # frame called `data`: \index{pipe}\index{aaapipesymb@\vert{}>|see{pipe}}
1294
- ```
1295
-
1296
- ``` {code-cell} ipython3
1297
- :tags: [remove-cell]
1298
-
1299
1282
data = pd.DataFrame({"old_col": [1, 2, 5, 0], "other_col": [1, 10, 3, 6]})
1300
1283
```
1301
1284
@@ -1330,28 +1313,6 @@ output = (
1330
1313
)
1331
1314
```
1332
1315
1333
- ``` {code-cell} ipython3
1334
- :tags: [remove-cell]
1335
-
1336
- # ``` {r eval = F}
1337
- # output <- select(filter(mutate(data, new_col = old_col * 2),
1338
- # other_col > 5),
1339
- # new_col)
1340
- # ```
1341
- # Code like this can also be difficult to understand. Functions compose (reading
1342
- # from left to right) in the *opposite order* in which they are computed by R
1343
- # (above, `mutate` happens first, then `filter`, then `select`). It is also just a
1344
- # really long line of code to read in one go.
1345
-
1346
- # The *pipe operator* (`|>`) solves this problem, resulting in cleaner and
1347
- # easier-to-follow code. `|>` is built into R so you don't need to load any
1348
- # packages to use it.
1349
- # You can think of the pipe as a physical pipe. It takes the output from the
1350
- # function on the left-hand side of the pipe, and passes it as the first argument
1351
- # to the function on the right-hand side of the pipe.
1352
- # The code below accomplishes the same thing as the previous
1353
- # two code blocks:
1354
- ```
1355
1316
1356
1317
> ** Note:** You might also have noticed that we split the function calls across
1357
1318
> lines, similar to when we did this earlier in the chapter
@@ -1360,35 +1321,7 @@ output = (
1360
1321
> your code more readable. When you do this, it is important to use parentheses
1361
1322
> to tell Python that your code is continuing onto the next line.
1362
1323
1363
- ``` {code-cell} ipython3
1364
- :tags: [remove-cell]
1365
-
1366
- # > **Note:** You might also have noticed that we split the function calls across
1367
- # > lines after the pipe, similar to when we did this earlier in the chapter
1368
- # > for long function calls. Again, this is allowed and recommended, especially when
1369
- # > the piped function calls create a long line of code. Doing this makes
1370
- # > your code more readable. When you do this, it is important to end each line
1371
- # > with the pipe operator `|>` to tell R that your code is continuing onto the
1372
- # > next line.
1373
-
1374
- # > **Note:** In this textbook, we will be using the base R pipe operator syntax, `|>`.
1375
- # > This base R `|>` pipe operator was inspired by a previous version of the pipe
1376
- # > operator, `%>%`. The `%>%` pipe operator is not built into R
1377
- # > and is from the `magrittr` R package.
1378
- # > The `tidyverse` metapackage imports the `%>%` pipe operator via `dplyr`
1379
- # > (which in turn imports the `magrittr` R package).
1380
- # > There are some other differences between `%>%` and `|>` related to
1381
- # > more advanced R uses, such as sharing and distributing code as R packages,
1382
- # > however, these are beyond the scope of this textbook.
1383
- # > We have this note in the book to make the reader aware that `%>%` exists
1384
- # > as it is still commonly used in data analysis code and in many data science
1385
- # > books and other resources.
1386
- # > In most cases these two pipes are interchangeable and either can be used.
1387
-
1388
- # \index{pipe}\index{aaapipesymbb@\%>\%|see{pipe}}
1389
- ```
1390
-
1391
- ### Chaining ` [] ` and ` .loc `
1324
+ ### Chaining with ` .loc `
1392
1325
1393
1326
+++
1394
1327
@@ -1420,37 +1353,27 @@ van_data_selected
1420
1353
1421
1354
Although this is valid code, there is a more readable approach we could take by
1422
1355
chaining the operations. With chaining, we do not need to create an intermediate
1423
- object to store the output from ` [] ` . Instead, we can directly call ` .loc ` upon the
1424
- output of ` [] ` :
1356
+ object to store the output from ` [] ` . Instead, we can directly call ` .loc ` to select
1357
+ the rows and columns we are interested in :
1425
1358
1426
1359
``` {code-cell} ipython3
1427
- van_data_selected = tidy_lang[tidy_lang["region"] == "Vancouver"] .loc[
1428
- : , ["language", "most_at_home"]
1360
+ van_data_selected = tidy_lang.loc[
1361
+ tidy_lang["region"] == "Vancouver" , ["language", "most_at_home"]
1429
1362
]
1430
-
1431
1363
van_data_selected
1432
1364
```
1433
1365
1434
- ``` {code-cell} ipython3
1435
- :tags: [remove-cell]
1436
-
1437
- # But wait...Why do the `select` and `filter` function calls
1438
- # look different in these two examples?
1439
- # Remember: when you use the pipe,
1440
- # the output of the first function is automatically provided
1441
- # as the first argument for the function that comes after it.
1442
- # Therefore you do not specify the first argument in that function call.
1443
- # In the code above,
1444
- # the first line is just the `tidy_lang` data frame with a pipe.
1445
- # The pipe passes the left-hand side (`tidy_lang`) to the first argument of the function on the right (`filter`),
1446
- # so in the `filter` function you only see the second argument (and beyond).
1447
- # Then again after `filter` there is a pipe, which passes the result of the `filter` step
1448
- # to the first argument of the `select` function.
1449
- ```
1450
-
1451
1366
As you can see, both of these approaches&mdash ; with and without chaining&mdash ; give us the same output, but the second
1452
1367
approach is clearer and more readable.
1453
1368
1369
+ <!-- Note that the following which uses `[]` and `.loc[]` is valid, but discouraged as it is more difficult to follow
1370
+ ```{code-cell} ipython3
1371
+ van_data_selected = tidy_lang[tidy_lang["region"] == "Vancouver"].loc[
1372
+ :, ["language", "most_at_home"]
1373
+ ]
1374
+ van_data_selected
1375
+ ``` -->
1376
+
1454
1377
+++
1455
1378
1456
1379
### Chaining more than two functions
@@ -1459,12 +1382,12 @@ approach is clearer and more readable.
1459
1382
1460
1383
Chaining can be used with any method in Python.
1461
1384
Additionally, we can chain together more than two functions.
1462
- For example, we can chain together three functions to:
1385
+ For example, we can chain together functions to:
1463
1386
1464
- - extract rows (` [] ` ) to include only those where the counts of the language most spoken at home are greater than 10,000,
1465
- - extract only the columns (` .loc ` ) corresponding to ` region ` , ` language ` and ` most_at_home ` , and
1387
+ - extract rows (` .loc ` ) to include only those where the counts of the language most spoken at home are greater than 10,000,
1388
+ - also extract only the columns (` .loc ` ) corresponding to ` region ` , ` language ` and ` most_at_home ` , and
1466
1389
- sort the data frame rows in order (` .sort_values ` ) by counts of the language most spoken at home
1467
- from smallest to largest.
1390
+ from smallest to largest. The first two steps can be accimplished in one use of ` .loc `
1468
1391
1469
1392
``` {index} pandas.DataFrame; sort_values
1470
1393
```
@@ -1476,31 +1399,15 @@ Here we pass the column name `most_at_home` to sort the data frame rows by the v
1476
1399
1477
1400
``` {code-cell} ipython3
1478
1401
large_region_lang = (
1479
- tidy_lang[tidy_lang["most_at_home"] > 10000]
1480
- .loc[:, ["region", "language", "most_at_home"]]
1481
- .sort_values(by="most_at_home")
1402
+ tidy_lang.loc[
1403
+ tidy_lang["most_at_home"] > 10000,
1404
+ ["region", "language", "most_at_home"]
1405
+ ]
1406
+ .sort_values("most_at_home")
1482
1407
)
1483
-
1484
1408
large_region_lang
1485
1409
```
1486
1410
1487
- ``` {code-cell} ipython3
1488
- :tags: [remove-cell]
1489
-
1490
- # You will notice above that we passed `tidy_lang` as the first argument of the `filter` function.
1491
- # We can also pipe the data frame into the same sequence of functions rather than
1492
- # using it as the first argument of the first function. These two choices are equivalent,
1493
- # and we get the same result.
1494
- # ``` {r}
1495
- # large_region_lang <- tidy_lang |>
1496
- # filter(most_at_home > 10000) |>
1497
- # select(region, language, most_at_home) |>
1498
- # arrange(most_at_home)
1499
-
1500
- # large_region_lang
1501
- # ```
1502
- ```
1503
-
1504
1411
Now that we've shown you chaining as an alternative to storing
1505
1412
temporary objects and composing code, does this mean you should * never* store
1506
1413
temporary objects or compose code? Not necessarily!
@@ -1731,8 +1638,8 @@ can also be used. To do this, pass a list of column names to the `by` argument.
1731
1638
``` {code-cell} ipython3
1732
1639
region_summary = pd.DataFrame()
1733
1640
region_summary = region_summary.assign(
1734
- min_most_at_home=region_lang.groupby(by= "region")["most_at_home"].min(),
1735
- max_most_at_home=region_lang.groupby(by= "region")["most_at_home"].max()
1641
+ min_most_at_home=region_lang.groupby("region")["most_at_home"].min(),
1642
+ max_most_at_home=region_lang.groupby("region")["most_at_home"].max()
1736
1643
).reset_index()
1737
1644
1738
1645
region_summary.columns = ["region", "min_most_at_home", "max_most_at_home"]
@@ -1743,7 +1650,7 @@ region_summary
1743
1650
1744
1651
``` {code-cell} ipython3
1745
1652
region_summary = (
1746
- region_lang.groupby(by= "region")["most_at_home"].agg(["min", "max"]).reset_index()
1653
+ region_lang.groupby("region")["most_at_home"].agg(["min", "max"]).reset_index()
1747
1654
)
1748
1655
region_summary.columns = ["region", "min_most_at_home", "max_most_at_home"]
1749
1656
region_summary
@@ -1818,47 +1725,6 @@ summary methods (*e.g.* `.min`, `.max`, `.sum` etc.) can be used for data frames
1818
1725
pd.DataFrame(region_lang.iloc[:, 3:].max(axis=0)).T
1819
1726
```
1820
1727
1821
- ``` {code-cell} ipython3
1822
- ---
1823
- jupyter:
1824
- source_hidden: true
1825
- tags: [remove-cell]
1826
- ---
1827
- # To summarize statistics across many columns, we can use the
1828
- # `summarize` function we have just recently learned about.
1829
- # However, in such a case, using `summarize` alone means that we have to
1830
- # type out the name of each column we want to summarize.
1831
- # To do this more efficiently, we can pair `summarize` with `across` \index{across}
1832
- # and use a colon `:` to specify a range of columns we would like \index{column range}
1833
- # to perform the statistical summaries on.
1834
- # Here we demonstrate finding the maximum value
1835
- # of each of the numeric
1836
- # columns of the `region_lang` data set.
1837
-
1838
- # ``` {r 02-across-data}
1839
- # region_lang |>
1840
- # summarize(across(mother_tongue:lang_known, max))
1841
- # ```
1842
-
1843
- # > **Note:** Similar to when we use base R statistical summary functions
1844
- # > (e.g., `max`, `min`, `mean`, `sum`, etc) with `summarize` alone,
1845
- # > the use of the `summarize` + `across` functions paired
1846
- # > with base R statistical summary functions
1847
- # > also return `NA`s when we apply them to columns that
1848
- # > contain `NA`s in the data frame. \index{missing data}
1849
- # >
1850
- # > To avoid this, again we need to add the argument `na.rm = TRUE`,
1851
- # > but in this case we need to use it a little bit differently.
1852
- # > In this case, we need to add a `,` and then `na.rm = TRUE`,
1853
- # > after specifying the function we want `summarize` + `across` to apply,
1854
- # > as illustrated below:
1855
- # >
1856
- # > ``` {r}
1857
- # > region_lang_na |>
1858
- # > summarize(across(mother_tongue:lang_known, max, na.rm = TRUE))
1859
- # > ```
1860
- ```
1861
-
1862
1728
(apply-summary)=
1863
1729
#### ` .apply ` for calculating summary statistics on many columns
1864
1730
@@ -1875,7 +1741,7 @@ We focus on the two arguments of `.apply`:
1875
1741
the function that you would like to apply to each column, and the ` axis ` along which the function will be applied (` 0 ` for columns, ` 1 ` for rows).
1876
1742
Note that ` .apply ` does not have an argument
1877
1743
to specify * which* columns to apply the function to.
1878
- Therefore, we will use the ` .iloc []` before calling ` .apply `
1744
+ Therefore, we will use the ` [] ` before calling ` .apply `
1879
1745
to choose the columns for which we want the maximum.
1880
1746
1881
1747
``` {code-cell} ipython3
@@ -1898,7 +1764,7 @@ tags: [remove-cell]
1898
1764
```
1899
1765
1900
1766
``` {code-cell} ipython3
1901
- pd.DataFrame(region_lang.iloc [:, 3: ].apply(max, axis=0)).T
1767
+ pd.DataFrame(region_lang[:, ["most_at_home", "most_at_work"] ].apply(max, axis=0)).T
1902
1768
```
1903
1769
1904
1770
``` {index} missing data
@@ -1917,7 +1783,7 @@ pd.DataFrame(region_lang.iloc[:, 3:].apply(max, axis=0)).T
1917
1783
1918
1784
``` {code-cell} ipython3
1919
1785
pd.DataFrame(
1920
- region_lang_na.iloc [:, 3: ].apply(lambda col: col.max(skipna=True), axis=0)
1786
+ region_lang_na[:, ["most_at_home", "most_at_work"] ].apply(lambda col: col.max(skipna=True), axis=0)
1921
1787
).T
1922
1788
```
1923
1789
@@ -2048,7 +1914,7 @@ To accomplish such a task, we can use `.apply`.
2048
1914
This works in a similar way for column selection,
2049
1915
as we saw when we used in Section {ref}` apply-summary ` earlier.
2050
1916
As we did above,
2051
- we again use ` .iloc ` to specify the columns
1917
+ we again use ` [] ` to specify the columns
2052
1918
as well as the ` .apply ` to specify the function we want to apply on these columns.
2053
1919
However, a key difference here is that we are not using aggregating function here,
2054
1920
which means that we get back a data frame with the same number of rows.
@@ -2074,8 +1940,8 @@ region_lang.info()
2074
1940
```
2075
1941
2076
1942
``` {code-cell} ipython3
2077
- region_lang_int32 = region_lang.iloc [:, 3: ].apply(lambda col: col.astype('int32'), axis=0)
2078
- region_lang_int32 = pd.concat((region_lang.iloc [:, :3 ], region_lang_int32), axis=1)
1943
+ region_lang_int32 = region_lang[:, ["most_at_home", "most_at_work"] ].apply(lambda col: col.astype('int32'), axis=0)
1944
+ region_lang_int32 = pd.concat((region_lang[:, ["most_at_home", "most_at_work"] ], region_lang_int32), axis=1)
2079
1945
region_lang_int32
2080
1946
```
2081
1947
@@ -2111,7 +1977,7 @@ For instance, suppose we want to know the maximum value between `mother_tongue`,
2111
1977
and ` lang_known ` for each language and region
2112
1978
in the ` region_lang ` data set.
2113
1979
In other words, we want to apply the ` max ` function * row-wise.*
2114
- Before we use ` .apply ` , we will again use ` .iloc ` to select only the count columns
1980
+ Before we use ` .apply ` , we will again use ` [] ` to select only the count columns
2115
1981
so we can see all the columns in the data frame's output easily in the book.
2116
1982
So for this demonstration, the data set we are operating on looks like this:
2117
1983
@@ -2135,7 +2001,7 @@ tags: [remove-cell]
2135
2001
```
2136
2002
2137
2003
``` {code-cell} ipython3
2138
- region_lang.iloc [:, 3: ]
2004
+ region_lang[:, ["most_at_home", "most_at_work"] ]
2139
2005
```
2140
2006
2141
2007
Now we use ` .apply ` with argument ` axis=1 ` , to tell Python that we would like
@@ -2157,7 +2023,7 @@ tags: [remove-cell]
2157
2023
2158
2024
``` {code-cell} ipython3
2159
2025
region_lang_rowwise = region_lang.assign(
2160
- maximum=region_lang.iloc [:, 3: ].apply(max, axis=1)
2026
+ maximum=region_lang[:, ["most_at_home", "most_at_work"] ].apply(max, axis=1)
2161
2027
)
2162
2028
2163
2029
region_lang_rowwise
0 commit comments