@@ -837,17 +837,28 @@ indicating they are integer data types (i.e., numbers)!
837
837
838
838
Now that the ` tidy_lang ` data is indeed * tidy* , we can start manipulating it
839
839
using the powerful suite of functions from the ` pandas ` .
840
- We revisit the ` [] ` from the chapter on {ref}` intro ` ,
841
- which lets us create a subset of rows from a data frame.
842
- Recall the argument to ` [] ` :
843
- a list of column names, or a logical statement that evaluates to either ` True ` or ` False ` ,
844
- where ` [] ` returns the rows where the logical statement evaluates to ` True ` .
845
- This section will highlight more advanced usage of the ` [] ` function.
846
- In particular, this section provides an in-depth treatment of the variety of logical statements
840
+ We will first revisit the ` [] ` from the chapter on {ref}` intro ` ,
841
+ which lets us obtain a subset of either the rows ** or** the columns of a data frame.
842
+ This section will highlight more advanced usage of the ` [] ` function,
843
+ including an in-depth treatment of the variety of logical statements
847
844
one can use in the ` [] ` to select subsets of rows.
848
845
849
846
+++
850
847
848
+ ### Extracting columns by name
849
+
850
+ Recall that if we provide a list of column names, ` [] ` returns the subset of columns with those names.
851
+ Suppose we wanted to select the columns ` language ` , ` region ` ,
852
+ ` most_at_home ` and ` most_at_work ` from the ` tidy_lang ` data set. Using what we
853
+ learned in the chapter on {ref}` intro ` , we can pass all of these column
854
+ names into the square brackets.
855
+
856
+ ``` {code-cell} ipython3
857
+ :tags: ["output_scroll"]
858
+ tidy_lang[["language", "region", "most_at_home", "most_at_work"]]
859
+ ```
860
+
861
+
851
862
### Extracting rows that have a certain value with ` == `
852
863
Suppose we are only interested in the subset of rows in ` tidy_lang ` corresponding to the
853
864
official languages of Canada (English and French).
@@ -1010,55 +1021,82 @@ is less often used than the earlier approaches we introduced, but it can come in
1010
1021
to make long chains of filtering operations a bit easier to read.
1011
1022
1012
1023
(loc-iloc)=
1013
- ## Using ` loc[] ` to filter rows and select columns.
1024
+ ## Using ` loc[] ` to filter rows and select columns
1025
+
1014
1026
``` {index} pandas.DataFrame; loc[]
1015
1027
```
1016
1028
1017
- The ` [] ` operation is only used when you want to filter rows or select columns;
1029
+ The ` [] ` operation is only used when you want to either filter rows ** or ** select columns;
1018
1030
it cannot be used to do both operations at the same time. This is where ` loc[] `
1019
1031
comes in. For the first example, recall ` loc[] ` from Chapter {ref}` intro ` ,
1020
- which lets us create a subset of columns from a data frame.
1021
- Suppose we wanted to select only the columns ` language ` , ` region ` ,
1022
- ` most_at_home ` and ` most_at_work ` from the ` tidy_lang ` data set. Using what we
1023
- learned in the chapter on {ref} ` intro ` , we would pass all of these column names into the square brackets .
1032
+ which lets us create a subset of the rows and columns in the ` tidy_lang ` data frame.
1033
+ In the first argument to ` loc[] ` , we specify a logical statement that
1034
+ filters the rows to only those pertaining to the Toronto region,
1035
+ and the second argument specifies a list of columns to keep by name .
1024
1036
1025
1037
``` {code-cell} ipython3
1026
1038
:tags: ["output_scroll"]
1027
- selected_columns = tidy_lang.loc[:, ["language", "region", "most_at_home", "most_at_work"]]
1028
- selected_columns
1039
+ tidy_lang.loc[
1040
+ tidy_lang['region'] == 'Toronto',
1041
+ ["language", "region", "most_at_home", "most_at_work"]
1042
+ ]
1029
1043
```
1030
- We pass ` : ` before the comma indicating we want to retrieve all rows, and the list indicates
1031
- the columns that we want.
1032
1044
1033
- Note that we could obtain the same result by stating that we would like all of the columns
1034
- from ` language ` through ` most_at_work ` . Instead of passing a list of all of the column
1035
- names that we want, we can ask for the range of columns ` "language":"most_at_work" ` , which
1036
- you can read as "The columns from ` language ` to ` most_at_work ` ".
1045
+ In addition to simultaneous subsetting of rows and columns, ` loc[] ` has two
1046
+ more special capabilities beyond those of ` [] ` . First, ` loc[] ` has the ability to specify * ranges* of rows and columns.
1047
+ For example, note that the list of columns ` language ` , ` region ` , ` most_at_home ` , ` most_at_work `
1048
+ corresponds to the * range* of columns from ` language ` to ` most_at_work ` .
1049
+ Rather than explicitly listing all of the column names as we did above,
1050
+ we can ask for the range of columns ` "language":"most_at_work" ` ; the ` : ` -syntax
1051
+ denotes a range, and is supported by the ` loc[] ` function, but not by ` [] ` .
1037
1052
1038
1053
``` {code-cell} ipython3
1039
1054
:tags: ["output_scroll"]
1040
- selected_columns = tidy_lang.loc[:, "language":"most_at_work"]
1041
- selected_columns
1055
+ tidy_lang.loc[
1056
+ tidy_lang['region'] == 'Toronto',
1057
+ "language":"most_at_work"
1058
+ ]
1042
1059
```
1043
1060
1044
- Similarly, you can ask for all of the columns including and after ` language ` by doing the following
1061
+ We can pass ` : ` by itself&mdash ; without anything before or after&mdash ; to denote that we want to retrieve
1062
+ everything. For example, to obtain a subset of all rows and only those columns ranging from ` language ` to ` most_at_work ` ,
1063
+ we could use the following expression.
1045
1064
1046
1065
``` {code-cell} ipython3
1047
1066
:tags: ["output_scroll"]
1048
- selected_columns = tidy_lang.loc[:, "language":]
1049
- selected_columns
1067
+ tidy_lang.loc[:, "language":"most_at_work"]
1050
1068
```
1051
1069
1052
- By not putting anything after the ` : ` , python reads this as "from ` language ` until the last column".
1053
- Although the notation for selecting a range using ` : ` is convienent because less code is required,
1070
+ We can also omit the beginning or end of the ` : ` range expression to denote
1071
+ that we want "everything up to" or "everything after" an element. For example,
1072
+ if we want all of the columns including and after ` language ` , we can write the expression:
1073
+
1074
+ ``` {code-cell} ipython3
1075
+ :tags: ["output_scroll"]
1076
+ tidy_lang.loc[:, "language":]
1077
+ ```
1078
+ By not putting anything after the ` : ` , Python reads this as "from ` language ` until the last column".
1079
+ Similarly, we can specify that we want everything up to and including ` language ` by writing
1080
+ the expression:
1081
+
1082
+ ``` {code-cell} ipython3
1083
+ :tags: ["output_scroll"]
1084
+ tidy_lang.loc[:, :"language"]
1085
+ ```
1086
+
1087
+ By not putting anything before the ` : ` , Python reads this as "from the first column until ` language ` ."
1088
+ Although the notation for selecting a range using ` : ` is convenient because less code is required,
1054
1089
it must be used carefully. If you were to re-order columns or add a column to the data frame, the
1055
- output would change. Using a list is more explicit and less prone to potential confusion.
1090
+ output would change. Using a list is more explicit and less prone to potential confusion, but sometimes
1091
+ involves a lot more typing.
1056
1092
1057
- Suppose instead we wanted to extract columns that followed a particular pattern
1058
- rather than just selecting a range. For example, let's say we wanted only to select the
1059
- columns ` most_at_home ` and ` most_at_work ` . There are other functions that allow
1060
- us to select variables based on their names. In particular, we can use the ` .str.startswith ` method
1061
- to choose only the columns that start with the word "most":
1093
+ The second special capability of ` .loc[] ` over ` [] ` is that it enables * selecting columns* using
1094
+ logical statements. The ` [] ` operator can only use logical statements to filter rows; ` .loc[] ` can do both!
1095
+ For example, let's say we wanted only to select the
1096
+ columns ` most_at_home ` and ` most_at_work ` . We could then use the ` .str.startswith ` method
1097
+ to choose only the columns that start with the word "most".
1098
+ The ` str.startswith ` expression returns a list of ` True ` or ` False ` values
1099
+ corresponding to the column names that start with the desired characters.
1062
1100
1063
1101
``` {code-cell} ipython3
1064
1102
tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]
@@ -1075,50 +1113,41 @@ the columns we want contain underscores and the others don't.
1075
1113
tidy_lang.loc[:, tidy_lang.columns.str.contains('_')]
1076
1114
```
1077
1115
1078
- There are many different functions that help with selecting
1079
- variables based on certain criteria.
1080
- The additional resources section at the end of this chapter
1081
- provides a comprehensive resource on these functions.
1082
-
1083
- ``` {code-cell} ipython3
1084
- :tags: [remove-cell]
1085
-
1086
- # There are many different `select` helpers that select
1087
- # variables based on certain criteria.
1088
- # The additional resources section at the end of this chapter
1089
- # provides a comprehensive resource on `select` helpers.
1090
- ```
1091
-
1092
- ## Using ` iloc[] ` to extract a range of columns
1116
+ ## Using ` iloc[] ` to extract rows and columns by position
1093
1117
``` {index} pandas.DataFrame; iloc[], column range
1094
1118
```
1095
- Another approach for selecting columns is to use ` iloc[] ` ,
1096
- which provides the ability to index with integers rather than the names of the columns.
1097
- For example, the column names of the ` tidy_lang ` data frame are
1119
+ Another approach for selecting rows and columns is to use ` iloc[] ` ,
1120
+ which provides the ability to index with the position rather than the label of the columns.
1121
+ For example, the column labels of the ` tidy_lang ` data frame are
1098
1122
` ['category', 'language', 'region', 'most_at_home', 'most_at_work'] ` .
1099
1123
Using ` iloc[] ` , you can ask for the ` language ` column by requesting the
1100
1124
column at index ` 1 ` (remember that Python starts counting at ` 0 ` , so the second item ` 'language' `
1101
1125
has index ` 1 ` !).
1102
1126
1103
1127
``` {code-cell} ipython3
1104
- column = tidy_lang.iloc[:, 1]
1105
- column
1128
+ tidy_lang.iloc[:, 1]
1106
1129
```
1107
1130
1108
- You can also ask for multiple columns, just like we did with ` [] ` . We pass ` : ` before
1109
- the comma, indicating we want to retrieve all rows, and ` 1: ` after the comma
1131
+ You can also ask for multiple columns.
1132
+ We pass ` 1: ` after the comma
1110
1133
indicating we want columns after and including index 1 (* i.e.* ` language ` ).
1111
1134
1112
1135
``` {code-cell} ipython3
1113
- column_range = tidy_lang.iloc[:, 1:]
1114
- column_range
1136
+ tidy_lang.iloc[:, 1:]
1115
1137
```
1116
1138
1117
- The ` iloc[] ` method is less commonly used, and needs to be used with care.
1139
+ We can also use ` iloc[] ` to select ranges of rows, or simultaneously select ranges of rows and columns, using a similar syntax.
1140
+ For example, to select the first five rows and columns after and including index 1, we could use the following:
1141
+
1142
+ ``` {code-cell} ipython3
1143
+ tidy_lang.iloc[:5, 1:]
1144
+ ```
1145
+
1146
+ Note that the ` iloc[] ` method is not commonly used, and must be used with care.
1118
1147
For example, it is easy to
1119
1148
accidentally put in the wrong integer index! If you did not correctly remember
1120
1149
that the ` language ` column was index ` 1 ` , and used ` 2 ` instead, your code
1121
- would end up having a bug that might be quite hard to track down.
1150
+ might end up having a bug that is quite hard to track down.
1122
1151
1123
1152
``` {index} pandas.Series; str.startswith
1124
1153
```
@@ -1247,52 +1276,44 @@ summary statistics that you can compute with `pandas`.
1247
1276
What if you want to calculate summary statistics on an entire data frame? Well,
1248
1277
it turns out that the functions in {numref}` tab:basic-summary-statistics `
1249
1278
can be applied to a whole data frame!
1250
- For example, we can ask for the number of rows that each column has using ` count ` .
1251
- ``` {code-cell} ipython3
1252
- region_lang.count()
1253
- ```
1254
- Not surprisingly, they are all the same. We could also ask for the ` mean ` , but
1255
- some of the columns in ` region_lang ` contain string data with words like ` "Vancouver" `
1256
- and ` "Halifax" ` ---for these columns there is no way for ` pandas ` to compute the mean.
1257
- So we provide the keyword ` numeric_only=True ` so that it only computes the mean of columns with numeric values. This
1258
- is also needed if you want the ` sum ` or ` std ` .
1259
- ``` {code-cell} ipython3
1260
- region_lang.mean(numeric_only=True)
1261
- ```
1262
- If we ask for the ` min ` or the ` max ` , ` pandas ` will give you the smallest or largest number
1263
- for columns with numeric values. For columns with text, it will return the
1264
- least repeated value for ` min ` and the most repeated value for ` max ` . Again,
1265
- if you only want the minimum and maximum value for
1266
- numeric columns, you can provide ` numeric_only=True ` .
1279
+ For example, we can ask for the maximum value of each each column has using ` max ` .
1280
+
1267
1281
``` {code-cell} ipython3
1268
1282
region_lang.max()
1269
1283
```
1284
+
1285
+ We can see that for columns that contain string data
1286
+ with words like ` "Vancouver" ` and ` "Halifax" ` ,
1287
+ the maximum value is determined by sorting the string alphabetically
1288
+ and returning the last value.
1289
+ If we only want the maximum value for
1290
+ numeric columns,
1291
+ we can provide ` numeric_only=True ` :
1292
+
1270
1293
``` {code-cell} ipython3
1271
- region_lang.min( )
1294
+ region_lang.max(numeric_only=True )
1272
1295
```
1273
1296
1274
- Similarly, if there are only some columns for which you would like to get summary statistics,
1275
- you can first use ` loc[] ` and then ask for the summary statistic. An example of this is illustrated in {numref}` fig:summarize-across ` .
1276
- Later, we will talk about how you can also use a more general function, ` apply ` , to accomplish this.
1297
+ We could also ask for the ` mean ` for each columns in the dataframe.
1298
+ It does not make sense to compute the mean of the string columns,
1299
+ so in this case we * must* provide the keyword ` numeric_only=True `
1300
+ so that the mean is only computed on columns with numeric values.
1277
1301
1278
- ``` {figure} img/summarize/summarize.003.jpeg
1279
- :name: fig:summarize-across
1280
- :figclass: figure
1281
-
1282
- `loc[]` or `apply` is useful for efficiently calculating summary statistics on
1283
- many columns at once. The darker, top row of each table represents the column
1284
- headers.
1302
+ ``` {code-cell} ipython3
1303
+ region_lang.mean(numeric_only=True)
1285
1304
```
1286
1305
1287
- Lets say that we want to know
1288
- the mean and standard deviation of all of the columns between ` "mother_tongue" ` and ` "lang_known" ` .
1289
- We use ` loc[] ` to specify the columns and then ` agg ` to ask for both the ` mean ` and ` std ` .
1306
+ If there are only some columns for which you would like to get summary statistics,
1307
+ you can first use ` [] ` or ` .loc[] ` to select those columns,
1308
+ and then ask for the summary statistic
1309
+ as we did for a single column previously.
1310
+ For example, if we want to know
1311
+ the mean and standard deviation of all of the columns between ` "mother_tongue" ` and ` "lang_known" ` ,
1312
+ we use ` .loc[] ` to select those columns and then ` agg ` to ask for both the ` mean ` and ` std ` .
1290
1313
``` {code-cell} ipython3
1291
1314
region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"])
1292
1315
```
1293
1316
1294
-
1295
-
1296
1317
## Performing operations on groups of rows using ` groupby `
1297
1318
1298
1319
+++
@@ -1330,56 +1351,89 @@ The `groupby` function takes at least one argument—the columns to use in t
1330
1351
grouping. Here we use only one column for grouping (` region ` ).
1331
1352
1332
1353
``` {code-cell} ipython3
1333
- region_lang.groupby("region")["most_at_home"].agg(["min", "max"])
1354
+ region_lang.groupby("region")
1334
1355
```
1335
1356
1336
1357
Notice that ` groupby ` converts a ` DataFrame ` object to a ` DataFrameGroupBy `
1337
1358
object, which contains information about the groups of the data frame. We can
1338
- then apply aggregating functions to the ` DataFrameGroupBy ` object. This can be handy if you would like to perform multiple operations and assign
1339
- each output to its own object.
1359
+ then apply aggregating functions to the ` DataFrameGroupBy ` object. Here we first
1360
+ select the ` most_at_home ` column, and then summarize the grouped data by their
1361
+ minimum and maximum values using ` agg ` .
1362
+
1340
1363
``` {code-cell} ipython3
1341
- region_lang.groupby("region")
1364
+ region_lang.groupby("region")["most_at_home"].agg(["min", "max"])
1342
1365
```
1343
1366
1367
+ The resulting dataframe has ` region ` as an index name.
1368
+ This is similar to what happened when we used the ` pivot ` function
1369
+ in the section on {ref}` pivot-wider ` ;
1370
+ and just as we did then,
1371
+ you can use ` reset_index ` to get back to a regular dataframe
1372
+ with ` region ` as a column name.
1373
+
1374
+ ``` {code-cell} ipython3
1375
+ region_lang.groupby("region")["most_at_home"].agg(["min", "max"]).reset_index()
1376
+ ```
1344
1377
You can also pass multiple column names to ` groupby ` . For example, if we wanted to
1345
1378
know about how the different categories of languages (Aboriginal, Non-Official &
1346
1379
Non-Aboriginal, and Official) are spoken at home in different regions, we would pass a
1347
1380
list including ` region ` and ` category ` to ` groupby ` .
1381
+
1348
1382
``` {code-cell} ipython3
1349
1383
region_lang.groupby(["region", "category"])["most_at_home"].agg(["min", "max"])
1350
1384
```
1351
1385
1352
- You can also ask for grouped summary statistics on the whole data frame
1386
+ You can also ask for grouped summary statistics on the whole data frame.
1387
+
1353
1388
``` {code-cell} ipython3
1354
1389
:tags: ["output_scroll"]
1355
1390
region_lang.groupby("region").agg(["min", "max"])
1356
1391
```
1357
1392
1358
1393
If you want to ask for only some columns, for example
1359
1394
the columns between ` "most_at_home" ` and ` "lang_known" ` ,
1360
- you might think about first applying ` groupby ` and then ` loc ` ;
1395
+ you might think about first applying ` groupby ` and then ` ["most_at_home":"lang_known"] ` ;
1361
1396
but ` groupby ` returns a ` DataFrameGroupBy ` object, which does not
1362
- work with ` loc ` . The other option is to do things the other way around:
1363
- first use ` loc ` , then use ` groupby ` .
1364
- This usually does work, but you have to be careful! For example,
1365
- in our case, if we try using ` loc ` and then ` groupby ` , we get an error.
1397
+ work with ranges inside ` [] ` .
1398
+ The other option is to do things the other way around:
1399
+ first use ` ["most_at_home":"lang_known"] ` , then use ` groupby ` .
1400
+ This can work, but you have to be careful! For example,
1401
+ in our case, we get an error.
1402
+
1366
1403
``` {code-cell} ipython3
1367
1404
:tags: [remove-output]
1368
- region_lang.loc[:, "most_at_home":"lang_known"].groupby("region").max()
1405
+ region_lang[ "most_at_home":"lang_known"].groupby("region").max()
1369
1406
```
1407
+
1370
1408
```
1371
1409
KeyError: 'region'
1372
1410
```
1373
- This is because when we use ` loc ` we selected only the columns between
1411
+
1412
+ This is because when we use ` [] ` we selected only the columns between
1374
1413
` "most_at_home" ` and ` "lang_known" ` , which doesn't include ` "region" ` !
1375
- Instead, we need to call ` loc ` with a list of column names that
1376
- includes ` region ` , and then use ` groupby ` .
1414
+ Instead, we need to use ` groupby ` first
1415
+ and then call ` [] ` with a list of column names that includes ` region ` ;
1416
+ this approach always works.
1417
+
1418
+ ``` {code-cell} ipython3
1419
+ :tags: ["output_scroll"]
1420
+ region_lang.groupby("region")[["most_at_home", "most_at_work", "lang_known"]].max()
1421
+ ```
1422
+
1423
+ To see how many observations there are in each group,
1424
+ we can use ` value_counts ` .
1425
+
1426
+ ``` {code-cell} ipython3
1427
+ :tags: ["output_scroll"]
1428
+ region_lang.value_counts("region")
1429
+ ```
1430
+
1431
+ Which takes the ` normalize ` parameter to show the output as proportion
1432
+ instead of a count.
1433
+
1377
1434
``` {code-cell} ipython3
1378
1435
:tags: ["output_scroll"]
1379
- region_lang.loc[
1380
- :,
1381
- ["region", "mother_tongue", "most_at_home", "most_at_work", "lang_known"]
1382
- ].groupby("region").max()
1436
+ region_lang.value_counts("region", normalize=True)
1383
1437
```
1384
1438
1385
1439
+++
0 commit comments