Skip to content

Commit 28608ab

Browse files
tc polish on wrangling (#186)
1 parent 87b7584 commit 28608ab

File tree

1 file changed

+75
-57
lines changed

1 file changed

+75
-57
lines changed

source/wrangling.md

+75-57
Original file line numberDiff line numberDiff line change
@@ -841,17 +841,28 @@ indicating they are integer data types (i.e., numbers)!
841841

842842
Now that the `tidy_lang` data is indeed *tidy*, we can start manipulating it
843843
using the powerful suite of functions from the `pandas`.
844-
We revisit the `[]` from the chapter on {ref}`intro`,
845-
which lets us create a subset of rows from a data frame.
846-
Recall the argument to `[]`:
847-
a list of column names, or a logical statement that evaluates to either `True` or `False`,
848-
where `[]` returns the rows where the logical statement evaluates to `True`.
849-
This section will highlight more advanced usage of the `[]` function.
850-
In particular, this section provides an in-depth treatment of the variety of logical statements
844+
We will first revisit the `[]` from the chapter on {ref}`intro`,
845+
which lets us obtain a subset of either the rows **or** the columns of a data frame.
846+
This section will highlight more advanced usage of the `[]` function,
847+
including an in-depth treatment of the variety of logical statements
851848
one can use in the `[]` to select subsets of rows.
852849

853850
+++
854851

852+
### Extracting columns by name
853+
854+
Recall that if we provide a list of column names, `[]` returns the subset of columns with those names.
855+
Suppose we wanted to select the columns `language`, `region`,
856+
`most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we
857+
learned in the chapter on {ref}`intro`, we can pass all of these column
858+
names into the square brackets.
859+
860+
```{code-cell} ipython3
861+
:tags: ["output_scroll"]
862+
tidy_lang[["language", "region", "most_at_home", "most_at_work"]]
863+
```
864+
865+
855866
### Extracting rows that have a certain value with `==`
856867
Suppose we are only interested in the subset of rows in `tidy_lang` corresponding to the
857868
official languages of Canada (English and French).
@@ -1022,7 +1033,10 @@ to make long chains of filtering operations a bit easier to read.
10221033
The `[]` operation is only used when you want to either filter rows **or** select columns;
10231034
it cannot be used to do both operations at the same time. This is where `loc[]`
10241035
comes in. For the first example, recall `loc[]` from Chapter {ref}`intro`,
1025-
which lets us create a subset of columns from a data frame.
1036+
which lets us create a subset of the rows and columns in the `tidy_lang` data frame.
1037+
In the first argument to `loc[]`, we specify a logical statement that
1038+
filters the rows to only those pertaining to the Toronto region,
1039+
and the second argument specifies a list of columns to keep by name.
10261040

10271041
```{code-cell} ipython3
10281042
:tags: ["output_scroll"]
@@ -1032,53 +1046,61 @@ tidy_lang.loc[
10321046
]
10331047
```
10341048

1035-
### Using `loc[]` to select ranges of columns
1036-
1037-
Suppose we wanted to select only the columns `language`, `region`,
1038-
`most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we
1039-
learned in the chapter on {ref}`intro`, we would pass all of these column names into the square brackets.
1049+
In addition to simultaneous subsetting of rows and columns, `loc[]` has two
1050+
more special capabilities beyond those of `[]`. First, `loc[]` has the ability to specify *ranges* of rows and columns.
1051+
For example, note that the list of columns `language`, `region`, `most_at_home`, `most_at_work`
1052+
corresponds to the *range* of columns from `language` to `most_at_work`.
1053+
Rather than explicitly listing all of the column names as we did above,
1054+
we can ask for the range of columns `"language":"most_at_work"`; the `:`-syntax
1055+
denotes a range, and is supported by the `loc[]` function, but not by `[]`.
10401056

10411057
```{code-cell} ipython3
10421058
:tags: ["output_scroll"]
1043-
tidy_lang[["language", "region", "most_at_home", "most_at_work"]]
1059+
tidy_lang.loc[
1060+
tidy_lang['region'] == 'Toronto',
1061+
"language":"most_at_work"
1062+
]
10441063
```
10451064

1046-
Note that we could obtain the same result by stating that we would like all of the columns
1047-
from `language` through `most_at_work`. Instead of passing a list of all of the column
1048-
names that we want, we can ask for the range of columns `"language":"most_at_work"`, which
1049-
you can read as "The columns from `language` to `most_at_work`".
1050-
This `:`-syntax is supported by the `loc` function,
1051-
but not by the `[]`, so we need to switch to using `loc[]` here.
1065+
We can pass `:` by itself—without anything before or after—to denote that we want to retrieve
1066+
everything. For example, to obtain a subset of all rows and only those columns ranging from `language` to `most_at_work`,
1067+
we could use the following expression.
10521068

10531069
```{code-cell} ipython3
10541070
:tags: ["output_scroll"]
10551071
tidy_lang.loc[:, "language":"most_at_work"]
10561072
```
10571073

1058-
We pass `:` before the comma indicating we want to retrieve all rows,
1059-
i.e. we are not filtering any rows in this expression.
1060-
Similarly, you can ask for all of the columns including and after `language` by doing the following
1074+
We can also omit the beginning or end of the `:` range expression to denote
1075+
that we want "everything up to" or "everything after" an element. For example,
1076+
if we want all of the columns including and after `language`, we can write the expression:
10611077

10621078
```{code-cell} ipython3
10631079
:tags: ["output_scroll"]
10641080
tidy_lang.loc[:, "language":]
10651081
```
1066-
10671082
By not putting anything after the `:`, Python reads this as "from `language` until the last column".
1083+
Similarly, we can specify that we want everything up to and including `language` by writing
1084+
the expression:
1085+
1086+
```{code-cell} ipython3
1087+
:tags: ["output_scroll"]
1088+
tidy_lang.loc[:, :"language"]
1089+
```
1090+
1091+
By not putting anything before the `:`, Python reads this as "from the first column until `language`."
10681092
Although the notation for selecting a range using `:` is convenient because less code is required,
10691093
it must be used carefully. If you were to re-order columns or add a column to the data frame, the
1070-
output would change. Using a list is more explicit and less prone to potential confusion.
1094+
output would change. Using a list is more explicit and less prone to potential confusion, but sometimes
1095+
involves a lot more typing.
10711096

1072-
Suppose instead we wanted to extract columns that followed a particular pattern
1073-
rather than just selecting a range. For example, let's say we wanted only to select the
1074-
columns `most_at_home` and `most_at_work`. There are other functions that allow
1075-
us to select variables based on their names. In particular, we can use the `.str.startswith` method
1097+
The second special capability of `.loc[]` over `[]` is that it enables *selecting columns* using
1098+
logical statements. The `[]` operator can only use logical statements to filter rows; `.loc[]` can do both!
1099+
For example, let's say we wanted only to select the
1100+
columns `most_at_home` and `most_at_work`. We could then use the `.str.startswith` method
10761101
to choose only the columns that start with the word "most".
1077-
The `str.startswith` expression returns a boolean list
1078-
corresponding to the column names
1079-
which means that we have to use `.loc[]`
1080-
since passing this list to `[]`
1081-
would attempt to filter the rows instead of the columns.
1102+
The `str.startswith` expression returns a list of `True` or `False` values
1103+
corresponding to the column names that start with the desired characters.
10821104

10831105
```{code-cell} ipython3
10841106
tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]
@@ -1110,32 +1132,26 @@ has index `1`!).
11101132
tidy_lang.iloc[:, 1]
11111133
```
11121134

1113-
You can also ask for multiple columns,
1114-
we pass `1:` after the comma
1135+
You can also ask for multiple columns.
1136+
We pass `1:` after the comma
11151137
indicating we want columns after and including index 1 (*i.e.* `language`).
11161138

11171139
```{code-cell} ipython3
11181140
tidy_lang.iloc[:, 1:]
11191141
```
11201142

1121-
We can also use `iloc[]` to select ranges of rows, using a similar syntax.
1122-
For example to select the ten first rows we could use the following:
1123-
1124-
```{code-cell} ipython3
1125-
tidy_lang.iloc[:10, :]
1126-
```
1127-
1128-
`pandas` also provides a shorthand for selecting ranges of rows by using `[]`:
1143+
We can also use `iloc[]` to select ranges of rows, or simultaneously select ranges of rows and columns, using a similar syntax.
1144+
For example, to select the first five rows and columns after and including index 1, we could use the following:
11291145

11301146
```{code-cell} ipython3
1131-
tidy_lang[:10]
1147+
tidy_lang.iloc[:5, 1:]
11321148
```
11331149

1134-
The `iloc[]` method is less commonly used, and needs to be used with care.
1150+
Note that the `iloc[]` method is not commonly used, and must be used with care.
11351151
For example, it is easy to
11361152
accidentally put in the wrong integer index! If you did not correctly remember
11371153
that the `language` column was index `1`, and used `2` instead, your code
1138-
would end up having a bug that might be quite hard to track down.
1154+
might end up having a bug that is quite hard to track down.
11391155

11401156
```{index} pandas.Series; str.startswith
11411157
```
@@ -1292,12 +1308,12 @@ region_lang.mean(numeric_only=True)
12921308
```
12931309

12941310
If there are only some columns for which you would like to get summary statistics,
1295-
you can first use `[]` to select those columns
1296-
and then ask for the summary statistic,
1297-
as we did for a single column previously:
1298-
Lets say that we want to know
1299-
the mean and standard deviation of all of the columns between `"mother_tongue"` and `"lang_known"`.
1300-
We use `[]` to specify the columns and then `agg` to ask for both the `mean` and `std`.
1311+
you can first use `[]` or `.loc[]` to select those columns,
1312+
and then ask for the summary statistic
1313+
as we did for a single column previously.
1314+
For example, if we want to know
1315+
the mean and standard deviation of all of the columns between `"mother_tongue"` and `"lang_known"`,
1316+
we use `.loc[]` to select those columns and then `agg` to ask for both the `mean` and `std`.
13011317
```{code-cell} ipython3
13021318
region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"])
13031319
```
@@ -1344,15 +1360,17 @@ region_lang.groupby("region")
13441360

13451361
Notice that `groupby` converts a `DataFrame` object to a `DataFrameGroupBy`
13461362
object, which contains information about the groups of the data frame. We can
1347-
then apply aggregating functions to the `DataFrameGroupBy` object. This can be handy if you would like to perform multiple operations and assign
1348-
each output to its own object.
1363+
then apply aggregating functions to the `DataFrameGroupBy` object. Here we first
1364+
select the `most_at_home` column, and then summarize the grouped data by their
1365+
minimum and maximum values using `agg`.
13491366

13501367
```{code-cell} ipython3
13511368
region_lang.groupby("region")["most_at_home"].agg(["min", "max"])
13521369
```
13531370

13541371
The resulting dataframe has `region` as an index name.
1355-
This is similar to what happened when we reshaped data frames in the previous chapter,
1372+
This is similar to what happened when we used the `pivot` function
1373+
in the section on {ref}`pivot-wider`;
13561374
and just as we did then,
13571375
you can use `reset_index` to get back to a regular dataframe
13581376
with `region` as a column name.
@@ -1369,7 +1387,7 @@ list including `region` and `category` to `groupby`.
13691387
region_lang.groupby(["region", "category"])["most_at_home"].agg(["min", "max"])
13701388
```
13711389

1372-
You can also ask for grouped summary statistics on the whole data frame
1390+
You can also ask for grouped summary statistics on the whole data frame.
13731391

13741392
```{code-cell} ipython3
13751393
:tags: ["output_scroll"]

0 commit comments

Comments
 (0)