Skip to content

Commit 5fc3871

Browse files
committed
working on wrangling chapter
1 parent 51a6a7f commit 5fc3871

File tree

1 file changed

+48
-182
lines changed

1 file changed

+48
-182
lines changed

source/wrangling.md

+48-182
Original file line numberDiff line numberDiff line change
@@ -60,8 +60,8 @@ By the end of the chapter, readers will be able to do the following:
6060
- `and`
6161
- `or`
6262
- `[]`
63-
- `.iloc[]`
6463
- `.loc[]`
64+
- `.iloc[]`
6565

6666
## Data frames, series, and lists
6767

@@ -881,25 +881,6 @@ pd.Series(["Vancouver", "Toronto"]) == pd.Series(["Toronto", "Vancouver"])
881881
pd.Series(["Vancouver", "Toronto"]).isin(pd.Series(["Toronto", "Vancouver"]))
882882
```
883883

884-
```{code-cell} ipython3
885-
:tags: [remove-cell]
886-
887-
# > **Note:** What's the difference between `==` and `%in%`? Suppose we have two
888-
# > vectors, `vectorA` and `vectorB`. If you type `vectorA == vectorB` into R it
889-
# > will compare the vectors element by element. R checks if the first element of
890-
# > `vectorA` equals the first element of `vectorB`, the second element of
891-
# > `vectorA` equals the second element of `vectorB`, and so on. On the other hand,
892-
# > `vectorA %in% vectorB` compares the first element of `vectorA` to all the
893-
# > elements in `vectorB`. Then the second element of `vectorA` is compared
894-
# > to all the elements in `vectorB`, and so on. Notice the difference between `==` and
895-
# > `%in%` in the example below.
896-
# >
897-
# >``` {r}
898-
# >c("Vancouver", "Toronto") == c("Toronto", "Vancouver")
899-
# >c("Vancouver", "Toronto") %in% c("Toronto", "Vancouver")
900-
# >```
901-
```
902-
903884
### Extracting rows above or below a threshold using `>` and `<`
904885

905886
```{code-cell} ipython3
@@ -928,6 +909,19 @@ only English in Toronto is reported by more people
928909
as their primary language at home
929910
than French in Montréal according to the 2016 Canadian census.
930911

912+
### Extracting rows using `.query()`
913+
914+
You can also extract rows above, below, equal or not-equal to a threshold using the
915+
`.query()` method. For example the following gives us the same result as when we used
916+
`official_langs[official_langs["most_at_home"] > 2669195]`.
917+
918+
```{code-cell} ipython3
919+
official_langs.query("most_at_home > 2669195")
920+
```
921+
922+
The query (criteria we are using to select values) is input as a string. This will
923+
come in handy when we later talk about chaining.
924+
931925
(loc-iloc)=
932926
## Using `.loc[]` to filter rows and select columns.
933927
```{index} pandas.DataFrame; loc[]
@@ -1285,17 +1279,6 @@ multiple lines of code, storing temporary objects as you go:
12851279
```{code-cell} ipython3
12861280
:tags: [remove-cell]
12871281
1288-
# ## Combining functions using the pipe operator, `|>`
1289-
1290-
# In R, we often have to call multiple functions in a sequence to process a data
1291-
# frame. The basic ways of doing this can become quickly unreadable if there are
1292-
# many steps. For example, suppose we need to perform three operations on a data
1293-
# frame called `data`: \index{pipe}\index{aaapipesymb@\vert{}>|see{pipe}}
1294-
```
1295-
1296-
```{code-cell} ipython3
1297-
:tags: [remove-cell]
1298-
12991282
data = pd.DataFrame({"old_col": [1, 2, 5, 0], "other_col": [1, 10, 3, 6]})
13001283
```
13011284

@@ -1330,28 +1313,6 @@ output = (
13301313
)
13311314
```
13321315

1333-
```{code-cell} ipython3
1334-
:tags: [remove-cell]
1335-
1336-
# ``` {r eval = F}
1337-
# output <- select(filter(mutate(data, new_col = old_col * 2),
1338-
# other_col > 5),
1339-
# new_col)
1340-
# ```
1341-
# Code like this can also be difficult to understand. Functions compose (reading
1342-
# from left to right) in the *opposite order* in which they are computed by R
1343-
# (above, `mutate` happens first, then `filter`, then `select`). It is also just a
1344-
# really long line of code to read in one go.
1345-
1346-
# The *pipe operator* (`|>`) solves this problem, resulting in cleaner and
1347-
# easier-to-follow code. `|>` is built into R so you don't need to load any
1348-
# packages to use it.
1349-
# You can think of the pipe as a physical pipe. It takes the output from the
1350-
# function on the left-hand side of the pipe, and passes it as the first argument
1351-
# to the function on the right-hand side of the pipe.
1352-
# The code below accomplishes the same thing as the previous
1353-
# two code blocks:
1354-
```
13551316

13561317
> **Note:** You might also have noticed that we split the function calls across
13571318
> lines, similar to when we did this earlier in the chapter
@@ -1360,35 +1321,7 @@ output = (
13601321
> your code more readable. When you do this, it is important to use parentheses
13611322
> to tell Python that your code is continuing onto the next line.
13621323
1363-
```{code-cell} ipython3
1364-
:tags: [remove-cell]
1365-
1366-
# > **Note:** You might also have noticed that we split the function calls across
1367-
# > lines after the pipe, similar to when we did this earlier in the chapter
1368-
# > for long function calls. Again, this is allowed and recommended, especially when
1369-
# > the piped function calls create a long line of code. Doing this makes
1370-
# > your code more readable. When you do this, it is important to end each line
1371-
# > with the pipe operator `|>` to tell R that your code is continuing onto the
1372-
# > next line.
1373-
1374-
# > **Note:** In this textbook, we will be using the base R pipe operator syntax, `|>`.
1375-
# > This base R `|>` pipe operator was inspired by a previous version of the pipe
1376-
# > operator, `%>%`. The `%>%` pipe operator is not built into R
1377-
# > and is from the `magrittr` R package.
1378-
# > The `tidyverse` metapackage imports the `%>%` pipe operator via `dplyr`
1379-
# > (which in turn imports the `magrittr` R package).
1380-
# > There are some other differences between `%>%` and `|>` related to
1381-
# > more advanced R uses, such as sharing and distributing code as R packages,
1382-
# > however, these are beyond the scope of this textbook.
1383-
# > We have this note in the book to make the reader aware that `%>%` exists
1384-
# > as it is still commonly used in data analysis code and in many data science
1385-
# > books and other resources.
1386-
# > In most cases these two pipes are interchangeable and either can be used.
1387-
1388-
# \index{pipe}\index{aaapipesymbb@\%>\%|see{pipe}}
1389-
```
1390-
1391-
### Chaining `[]` and `.loc`
1324+
### Chaining with `.loc`
13921325

13931326
+++
13941327

@@ -1420,37 +1353,27 @@ van_data_selected
14201353

14211354
Although this is valid code, there is a more readable approach we could take by
14221355
chaining the operations. With chaining, we do not need to create an intermediate
1423-
object to store the output from `[]`. Instead, we can directly call `.loc` upon the
1424-
output of `[]`:
1356+
object to store the output from `[]`. Instead, we can directly call `.loc` to select
1357+
the rows and columns we are interested in:
14251358

14261359
```{code-cell} ipython3
1427-
van_data_selected = tidy_lang[tidy_lang["region"] == "Vancouver"].loc[
1428-
:, ["language", "most_at_home"]
1360+
van_data_selected = tidy_lang.loc[
1361+
tidy_lang["region"] == "Vancouver", ["language", "most_at_home"]
14291362
]
1430-
14311363
van_data_selected
14321364
```
14331365

1434-
```{code-cell} ipython3
1435-
:tags: [remove-cell]
1436-
1437-
# But wait...Why do the `select` and `filter` function calls
1438-
# look different in these two examples?
1439-
# Remember: when you use the pipe,
1440-
# the output of the first function is automatically provided
1441-
# as the first argument for the function that comes after it.
1442-
# Therefore you do not specify the first argument in that function call.
1443-
# In the code above,
1444-
# the first line is just the `tidy_lang` data frame with a pipe.
1445-
# The pipe passes the left-hand side (`tidy_lang`) to the first argument of the function on the right (`filter`),
1446-
# so in the `filter` function you only see the second argument (and beyond).
1447-
# Then again after `filter` there is a pipe, which passes the result of the `filter` step
1448-
# to the first argument of the `select` function.
1449-
```
1450-
14511366
As you can see, both of these approaches&mdash;with and without chaining&mdash;give us the same output, but the second
14521367
approach is clearer and more readable.
14531368

1369+
<!-- Note that the following which uses `[]` and `.loc[]` is valid, but discouraged as it is more difficult to follow
1370+
```{code-cell} ipython3
1371+
van_data_selected = tidy_lang[tidy_lang["region"] == "Vancouver"].loc[
1372+
:, ["language", "most_at_home"]
1373+
]
1374+
van_data_selected
1375+
``` -->
1376+
14541377
+++
14551378

14561379
### Chaining more than two functions
@@ -1459,12 +1382,12 @@ approach is clearer and more readable.
14591382

14601383
Chaining can be used with any method in Python.
14611384
Additionally, we can chain together more than two functions.
1462-
For example, we can chain together three functions to:
1385+
For example, we can chain together functions to:
14631386

1464-
- extract rows (`[]`) to include only those where the counts of the language most spoken at home are greater than 10,000,
1465-
- extract only the columns (`.loc`) corresponding to `region`, `language` and `most_at_home`, and
1387+
- extract rows (`.loc`) to include only those where the counts of the language most spoken at home are greater than 10,000,
1388+
- also extract only the columns (`.loc`) corresponding to `region`, `language` and `most_at_home`, and
14661389
- sort the data frame rows in order (`.sort_values`) by counts of the language most spoken at home
1467-
from smallest to largest.
1390+
from smallest to largest. The first two steps can be accimplished in one use of `.loc`
14681391

14691392
```{index} pandas.DataFrame; sort_values
14701393
```
@@ -1476,31 +1399,15 @@ Here we pass the column name `most_at_home` to sort the data frame rows by the v
14761399

14771400
```{code-cell} ipython3
14781401
large_region_lang = (
1479-
tidy_lang[tidy_lang["most_at_home"] > 10000]
1480-
.loc[:, ["region", "language", "most_at_home"]]
1481-
.sort_values(by="most_at_home")
1402+
tidy_lang.loc[
1403+
tidy_lang["most_at_home"] > 10000,
1404+
["region", "language", "most_at_home"]
1405+
]
1406+
.sort_values("most_at_home")
14821407
)
1483-
14841408
large_region_lang
14851409
```
14861410

1487-
```{code-cell} ipython3
1488-
:tags: [remove-cell]
1489-
1490-
# You will notice above that we passed `tidy_lang` as the first argument of the `filter` function.
1491-
# We can also pipe the data frame into the same sequence of functions rather than
1492-
# using it as the first argument of the first function. These two choices are equivalent,
1493-
# and we get the same result.
1494-
# ``` {r}
1495-
# large_region_lang <- tidy_lang |>
1496-
# filter(most_at_home > 10000) |>
1497-
# select(region, language, most_at_home) |>
1498-
# arrange(most_at_home)
1499-
1500-
# large_region_lang
1501-
# ```
1502-
```
1503-
15041411
Now that we've shown you chaining as an alternative to storing
15051412
temporary objects and composing code, does this mean you should *never* store
15061413
temporary objects or compose code? Not necessarily!
@@ -1731,8 +1638,8 @@ can also be used. To do this, pass a list of column names to the `by` argument.
17311638
```{code-cell} ipython3
17321639
region_summary = pd.DataFrame()
17331640
region_summary = region_summary.assign(
1734-
min_most_at_home=region_lang.groupby(by="region")["most_at_home"].min(),
1735-
max_most_at_home=region_lang.groupby(by="region")["most_at_home"].max()
1641+
min_most_at_home=region_lang.groupby("region")["most_at_home"].min(),
1642+
max_most_at_home=region_lang.groupby("region")["most_at_home"].max()
17361643
).reset_index()
17371644
17381645
region_summary.columns = ["region", "min_most_at_home", "max_most_at_home"]
@@ -1743,7 +1650,7 @@ region_summary
17431650

17441651
```{code-cell} ipython3
17451652
region_summary = (
1746-
region_lang.groupby(by="region")["most_at_home"].agg(["min", "max"]).reset_index()
1653+
region_lang.groupby("region")["most_at_home"].agg(["min", "max"]).reset_index()
17471654
)
17481655
region_summary.columns = ["region", "min_most_at_home", "max_most_at_home"]
17491656
region_summary
@@ -1818,47 +1725,6 @@ summary methods (*e.g.* `.min`, `.max`, `.sum` etc.) can be used for data frames
18181725
pd.DataFrame(region_lang.iloc[:, 3:].max(axis=0)).T
18191726
```
18201727

1821-
```{code-cell} ipython3
1822-
---
1823-
jupyter:
1824-
source_hidden: true
1825-
tags: [remove-cell]
1826-
---
1827-
# To summarize statistics across many columns, we can use the
1828-
# `summarize` function we have just recently learned about.
1829-
# However, in such a case, using `summarize` alone means that we have to
1830-
# type out the name of each column we want to summarize.
1831-
# To do this more efficiently, we can pair `summarize` with `across` \index{across}
1832-
# and use a colon `:` to specify a range of columns we would like \index{column range}
1833-
# to perform the statistical summaries on.
1834-
# Here we demonstrate finding the maximum value
1835-
# of each of the numeric
1836-
# columns of the `region_lang` data set.
1837-
1838-
# ``` {r 02-across-data}
1839-
# region_lang |>
1840-
# summarize(across(mother_tongue:lang_known, max))
1841-
# ```
1842-
1843-
# > **Note:** Similar to when we use base R statistical summary functions
1844-
# > (e.g., `max`, `min`, `mean`, `sum`, etc) with `summarize` alone,
1845-
# > the use of the `summarize` + `across` functions paired
1846-
# > with base R statistical summary functions
1847-
# > also return `NA`s when we apply them to columns that
1848-
# > contain `NA`s in the data frame. \index{missing data}
1849-
# >
1850-
# > To avoid this, again we need to add the argument `na.rm = TRUE`,
1851-
# > but in this case we need to use it a little bit differently.
1852-
# > In this case, we need to add a `,` and then `na.rm = TRUE`,
1853-
# > after specifying the function we want `summarize` + `across` to apply,
1854-
# > as illustrated below:
1855-
# >
1856-
# > ``` {r}
1857-
# > region_lang_na |>
1858-
# > summarize(across(mother_tongue:lang_known, max, na.rm = TRUE))
1859-
# > ```
1860-
```
1861-
18621728
(apply-summary)=
18631729
#### `.apply` for calculating summary statistics on many columns
18641730

@@ -1875,7 +1741,7 @@ We focus on the two arguments of `.apply`:
18751741
the function that you would like to apply to each column, and the `axis` along which the function will be applied (`0` for columns, `1` for rows).
18761742
Note that `.apply` does not have an argument
18771743
to specify *which* columns to apply the function to.
1878-
Therefore, we will use the `.iloc[]` before calling `.apply`
1744+
Therefore, we will use the `[]` before calling `.apply`
18791745
to choose the columns for which we want the maximum.
18801746

18811747
```{code-cell} ipython3
@@ -1898,7 +1764,7 @@ tags: [remove-cell]
18981764
```
18991765

19001766
```{code-cell} ipython3
1901-
pd.DataFrame(region_lang.iloc[:, 3:].apply(max, axis=0)).T
1767+
pd.DataFrame(region_lang[:, ["most_at_home", "most_at_work"]].apply(max, axis=0)).T
19021768
```
19031769

19041770
```{index} missing data
@@ -1917,7 +1783,7 @@ pd.DataFrame(region_lang.iloc[:, 3:].apply(max, axis=0)).T
19171783
19181784
```{code-cell} ipython3
19191785
pd.DataFrame(
1920-
region_lang_na.iloc[:, 3:].apply(lambda col: col.max(skipna=True), axis=0)
1786+
region_lang_na[:, ["most_at_home", "most_at_work"]].apply(lambda col: col.max(skipna=True), axis=0)
19211787
).T
19221788
```
19231789

@@ -2048,7 +1914,7 @@ To accomplish such a task, we can use `.apply`.
20481914
This works in a similar way for column selection,
20491915
as we saw when we used in Section {ref}`apply-summary` earlier.
20501916
As we did above,
2051-
we again use `.iloc` to specify the columns
1917+
we again use `[]` to specify the columns
20521918
as well as the `.apply` to specify the function we want to apply on these columns.
20531919
However, a key difference here is that we are not using aggregating function here,
20541920
which means that we get back a data frame with the same number of rows.
@@ -2074,8 +1940,8 @@ region_lang.info()
20741940
```
20751941

20761942
```{code-cell} ipython3
2077-
region_lang_int32 = region_lang.iloc[:, 3:].apply(lambda col: col.astype('int32'), axis=0)
2078-
region_lang_int32 = pd.concat((region_lang.iloc[:, :3], region_lang_int32), axis=1)
1943+
region_lang_int32 = region_lang[:, ["most_at_home", "most_at_work"]].apply(lambda col: col.astype('int32'), axis=0)
1944+
region_lang_int32 = pd.concat((region_lang[:, ["most_at_home", "most_at_work"]], region_lang_int32), axis=1)
20791945
region_lang_int32
20801946
```
20811947

@@ -2111,7 +1977,7 @@ For instance, suppose we want to know the maximum value between `mother_tongue`,
21111977
and `lang_known` for each language and region
21121978
in the `region_lang` data set.
21131979
In other words, we want to apply the `max` function *row-wise.*
2114-
Before we use `.apply`, we will again use `.iloc` to select only the count columns
1980+
Before we use `.apply`, we will again use `[]` to select only the count columns
21151981
so we can see all the columns in the data frame's output easily in the book.
21161982
So for this demonstration, the data set we are operating on looks like this:
21171983

@@ -2135,7 +2001,7 @@ tags: [remove-cell]
21352001
```
21362002

21372003
```{code-cell} ipython3
2138-
region_lang.iloc[:, 3:]
2004+
region_lang[:, ["most_at_home", "most_at_work"]]
21392005
```
21402006

21412007
Now we use `.apply` with argument `axis=1`, to tell Python that we would like
@@ -2157,7 +2023,7 @@ tags: [remove-cell]
21572023

21582024
```{code-cell} ipython3
21592025
region_lang_rowwise = region_lang.assign(
2160-
maximum=region_lang.iloc[:, 3:].apply(max, axis=1)
2026+
maximum=region_lang[:, ["most_at_home", "most_at_work"]].apply(max, axis=1)
21612027
)
21622028
21632029
region_lang_rowwise

0 commit comments

Comments
 (0)