@@ -841,17 +841,28 @@ indicating they are integer data types (i.e., numbers)!
841
841
842
842
Now that the ` tidy_lang ` data is indeed * tidy* , we can start manipulating it
843
843
using the powerful suite of functions from the ` pandas ` .
844
- We revisit the ` [] ` from the chapter on {ref}` intro ` ,
845
- which lets us create a subset of rows from a data frame.
846
- Recall the argument to ` [] ` :
847
- a list of column names, or a logical statement that evaluates to either ` True ` or ` False ` ,
848
- where ` [] ` returns the rows where the logical statement evaluates to ` True ` .
849
- This section will highlight more advanced usage of the ` [] ` function.
850
- In particular, this section provides an in-depth treatment of the variety of logical statements
844
+ We will first revisit the ` [] ` from the chapter on {ref}` intro ` ,
845
+ which lets us obtain a subset of either the rows ** or** the columns of a data frame.
846
+ This section will highlight more advanced usage of the ` [] ` function,
847
+ including an in-depth treatment of the variety of logical statements
851
848
one can use in the ` [] ` to select subsets of rows.
852
849
853
850
+++
854
851
852
+ ### Extracting columns by name
853
+
854
+ Recall that if we provide a list of column names, ` [] ` returns the subset of columns with those names.
855
+ Suppose we wanted to select the columns ` language ` , ` region ` ,
856
+ ` most_at_home ` and ` most_at_work ` from the ` tidy_lang ` data set. Using what we
857
+ learned in the chapter on {ref}` intro ` , we can pass all of these column
858
+ names into the square brackets.
859
+
860
+ ``` {code-cell} ipython3
861
+ :tags: ["output_scroll"]
862
+ tidy_lang[["language", "region", "most_at_home", "most_at_work"]]
863
+ ```
864
+
865
+
855
866
### Extracting rows that have a certain value with ` == `
856
867
Suppose we are only interested in the subset of rows in ` tidy_lang ` corresponding to the
857
868
official languages of Canada (English and French).
@@ -1022,7 +1033,10 @@ to make long chains of filtering operations a bit easier to read.
1022
1033
The ` [] ` operation is only used when you want to either filter rows ** or** select columns;
1023
1034
it cannot be used to do both operations at the same time. This is where ` loc[] `
1024
1035
comes in. For the first example, recall ` loc[] ` from Chapter {ref}` intro ` ,
1025
- which lets us create a subset of columns from a data frame.
1036
+ which lets us create a subset of the rows and columns in the ` tidy_lang ` data frame.
1037
+ In the first argument to ` loc[] ` , we specify a logical statement that
1038
+ filters the rows to only those pertaining to the Toronto region,
1039
+ and the second argument specifies a list of columns to keep by name.
1026
1040
1027
1041
``` {code-cell} ipython3
1028
1042
:tags: ["output_scroll"]
@@ -1032,53 +1046,61 @@ tidy_lang.loc[
1032
1046
]
1033
1047
```
1034
1048
1035
- ### Using ` loc[] ` to select ranges of columns
1036
-
1037
- Suppose we wanted to select only the columns ` language ` , ` region ` ,
1038
- ` most_at_home ` and ` most_at_work ` from the ` tidy_lang ` data set. Using what we
1039
- learned in the chapter on {ref}` intro ` , we would pass all of these column names into the square brackets.
1049
+ In addition to simultaneous subsetting of rows and columns, ` loc[] ` has two
1050
+ more special capabilities beyond those of ` [] ` . First, ` loc[] ` has the ability to specify * ranges* of rows and columns.
1051
+ For example, note that the list of columns ` language ` , ` region ` , ` most_at_home ` , ` most_at_work `
1052
+ corresponds to the * range* of columns from ` language ` to ` most_at_work ` .
1053
+ Rather than explicitly listing all of the column names as we did above,
1054
+ we can ask for the range of columns ` "language":"most_at_work" ` ; the ` : ` -syntax
1055
+ denotes a range, and is supported by the ` loc[] ` function, but not by ` [] ` .
1040
1056
1041
1057
``` {code-cell} ipython3
1042
1058
:tags: ["output_scroll"]
1043
- tidy_lang[["language", "region", "most_at_home", "most_at_work"]]
1059
+ tidy_lang.loc[
1060
+ tidy_lang['region'] == 'Toronto',
1061
+ "language":"most_at_work"
1062
+ ]
1044
1063
```
1045
1064
1046
- Note that we could obtain the same result by stating that we would like all of the columns
1047
- from ` language ` through ` most_at_work ` . Instead of passing a list of all of the column
1048
- names that we want, we can ask for the range of columns ` "language":"most_at_work" ` , which
1049
- you can read as "The columns from ` language ` to ` most_at_work ` ".
1050
- This ` : ` -syntax is supported by the ` loc ` function,
1051
- but not by the ` [] ` , so we need to switch to using ` loc[] ` here.
1065
+ We can pass ` : ` by itself&mdash ; without anything before or after&mdash ; to denote that we want to retrieve
1066
+ everything. For example, to obtain a subset of all rows and only those columns ranging from ` language ` to ` most_at_work ` ,
1067
+ we could use the following expression.
1052
1068
1053
1069
``` {code-cell} ipython3
1054
1070
:tags: ["output_scroll"]
1055
1071
tidy_lang.loc[:, "language":"most_at_work"]
1056
1072
```
1057
1073
1058
- We pass ` : ` before the comma indicating we want to retrieve all rows,
1059
- i.e. we are not filtering any rows in this expression.
1060
- Similarly, you can ask for all of the columns including and after ` language ` by doing the following
1074
+ We can also omit the beginning or end of the ` : ` range expression to denote
1075
+ that we want "everything up to" or "everything after" an element. For example,
1076
+ if we want all of the columns including and after ` language ` , we can write the expression:
1061
1077
1062
1078
``` {code-cell} ipython3
1063
1079
:tags: ["output_scroll"]
1064
1080
tidy_lang.loc[:, "language":]
1065
1081
```
1066
-
1067
1082
By not putting anything after the ` : ` , Python reads this as "from ` language ` until the last column".
1083
+ Similarly, we can specify that we want everything up to and including ` language ` by writing
1084
+ the expression:
1085
+
1086
+ ``` {code-cell} ipython3
1087
+ :tags: ["output_scroll"]
1088
+ tidy_lang.loc[:, :"language"]
1089
+ ```
1090
+
1091
+ By not putting anything before the ` : ` , Python reads this as "from the first column until ` language ` ."
1068
1092
Although the notation for selecting a range using ` : ` is convenient because less code is required,
1069
1093
it must be used carefully. If you were to re-order columns or add a column to the data frame, the
1070
- output would change. Using a list is more explicit and less prone to potential confusion.
1094
+ output would change. Using a list is more explicit and less prone to potential confusion, but sometimes
1095
+ involves a lot more typing.
1071
1096
1072
- Suppose instead we wanted to extract columns that followed a particular pattern
1073
- rather than just selecting a range. For example, let's say we wanted only to select the
1074
- columns ` most_at_home ` and ` most_at_work ` . There are other functions that allow
1075
- us to select variables based on their names. In particular, we can use the ` .str.startswith ` method
1097
+ The second special capability of ` .loc[] ` over ` [] ` is that it enables * selecting columns * using
1098
+ logical statements. The ` [] ` operator can only use logical statements to filter rows; ` .loc[] ` can do both!
1099
+ For example, let's say we wanted only to select the
1100
+ columns ` most_at_home ` and ` most_at_work ` . We could then use the ` .str.startswith ` method
1076
1101
to choose only the columns that start with the word "most".
1077
- The ` str.startswith ` expression returns a boolean list
1078
- corresponding to the column names
1079
- which means that we have to use ` .loc[] `
1080
- since passing this list to ` [] `
1081
- would attempt to filter the rows instead of the columns.
1102
+ The ` str.startswith ` expression returns a list of ` True ` or ` False ` values
1103
+ corresponding to the column names that start with the desired characters.
1082
1104
1083
1105
``` {code-cell} ipython3
1084
1106
tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]
@@ -1110,32 +1132,26 @@ has index `1`!).
1110
1132
tidy_lang.iloc[:, 1]
1111
1133
```
1112
1134
1113
- You can also ask for multiple columns,
1114
- we pass ` 1: ` after the comma
1135
+ You can also ask for multiple columns.
1136
+ We pass ` 1: ` after the comma
1115
1137
indicating we want columns after and including index 1 (* i.e.* ` language ` ).
1116
1138
1117
1139
``` {code-cell} ipython3
1118
1140
tidy_lang.iloc[:, 1:]
1119
1141
```
1120
1142
1121
- We can also use ` iloc[] ` to select ranges of rows, using a similar syntax.
1122
- For example to select the ten first rows we could use the following:
1123
-
1124
- ``` {code-cell} ipython3
1125
- tidy_lang.iloc[:10, :]
1126
- ```
1127
-
1128
- ` pandas ` also provides a shorthand for selecting ranges of rows by using ` [] ` :
1143
+ We can also use ` iloc[] ` to select ranges of rows, or simultaneously select ranges of rows and columns, using a similar syntax.
1144
+ For example, to select the first five rows and columns after and including index 1, we could use the following:
1129
1145
1130
1146
``` {code-cell} ipython3
1131
- tidy_lang[:10 ]
1147
+ tidy_lang.iloc[:5, 1: ]
1132
1148
```
1133
1149
1134
- The ` iloc[] ` method is less commonly used, and needs to be used with care.
1150
+ Note that the ` iloc[] ` method is not commonly used, and must be used with care.
1135
1151
For example, it is easy to
1136
1152
accidentally put in the wrong integer index! If you did not correctly remember
1137
1153
that the ` language ` column was index ` 1 ` , and used ` 2 ` instead, your code
1138
- would end up having a bug that might be quite hard to track down.
1154
+ might end up having a bug that is quite hard to track down.
1139
1155
1140
1156
``` {index} pandas.Series; str.startswith
1141
1157
```
@@ -1292,12 +1308,12 @@ region_lang.mean(numeric_only=True)
1292
1308
```
1293
1309
1294
1310
If there are only some columns for which you would like to get summary statistics,
1295
- you can first use ` [] ` to select those columns
1296
- and then ask for the summary statistic,
1297
- as we did for a single column previously:
1298
- Lets say that we want to know
1299
- the mean and standard deviation of all of the columns between ` "mother_tongue" ` and ` "lang_known" ` .
1300
- We use ` [] ` to specify the columns and then ` agg ` to ask for both the ` mean ` and ` std ` .
1311
+ you can first use ` [] ` or ` .loc[] ` to select those columns,
1312
+ and then ask for the summary statistic
1313
+ as we did for a single column previously.
1314
+ For example, if we want to know
1315
+ the mean and standard deviation of all of the columns between ` "mother_tongue" ` and ` "lang_known" ` ,
1316
+ we use ` .loc []` to select those columns and then ` agg ` to ask for both the ` mean ` and ` std ` .
1301
1317
``` {code-cell} ipython3
1302
1318
region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"])
1303
1319
```
@@ -1344,15 +1360,17 @@ region_lang.groupby("region")
1344
1360
1345
1361
Notice that ` groupby ` converts a ` DataFrame ` object to a ` DataFrameGroupBy `
1346
1362
object, which contains information about the groups of the data frame. We can
1347
- then apply aggregating functions to the ` DataFrameGroupBy ` object. This can be handy if you would like to perform multiple operations and assign
1348
- each output to its own object.
1363
+ then apply aggregating functions to the ` DataFrameGroupBy ` object. Here we first
1364
+ select the ` most_at_home ` column, and then summarize the grouped data by their
1365
+ minimum and maximum values using ` agg ` .
1349
1366
1350
1367
``` {code-cell} ipython3
1351
1368
region_lang.groupby("region")["most_at_home"].agg(["min", "max"])
1352
1369
```
1353
1370
1354
1371
The resulting dataframe has ` region ` as an index name.
1355
- This is similar to what happened when we reshaped data frames in the previous chapter,
1372
+ This is similar to what happened when we used the ` pivot ` function
1373
+ in the section on {ref}` pivot-wider ` ;
1356
1374
and just as we did then,
1357
1375
you can use ` reset_index ` to get back to a regular dataframe
1358
1376
with ` region ` as a column name.
@@ -1369,7 +1387,7 @@ list including `region` and `category` to `groupby`.
1369
1387
region_lang.groupby(["region", "category"])["most_at_home"].agg(["min", "max"])
1370
1388
```
1371
1389
1372
- You can also ask for grouped summary statistics on the whole data frame
1390
+ You can also ask for grouped summary statistics on the whole data frame.
1373
1391
1374
1392
``` {code-cell} ipython3
1375
1393
:tags: ["output_scroll"]
0 commit comments