@@ -55,57 +55,19 @@ By the end of the chapter, readers will be able to do the following:
55
55
- ` .str.split `
56
56
- Recall and use the following operators for their
57
57
intended data wrangling tasks:
58
- - ` == `
58
+ - ` == `
59
59
- ` in `
60
60
- ` and `
61
61
- ` or `
62
- - ` df []`
62
+ - ` [] `
63
63
- ` .iloc[] `
64
64
- ` .loc[] `
65
65
66
- ``` {code-cell} ipython3
67
- ---
68
- jupyter:
69
- source_hidden: true
70
- tags: [remove-cell]
71
- ---
72
- # By the end of the chapter, readers will be able to do the following:
73
-
74
- # - Define the term "tidy data".
75
- # - Discuss the advantages of storing data in a tidy data format.
76
- # - Define what vectors, lists, and data frames are in R, and describe how they relate to
77
- # each other.
78
- # - Describe the common types of data in R and their uses.
79
- # - Recall and use the following functions for their
80
- # intended data wrangling tasks:
81
- # - `across`
82
- # - `c`
83
- # - `filter`
84
- # - `group_by`
85
- # - `select`
86
- # - `map`
87
- # - `mutate`
88
- # - `pull`
89
- # - `pivot_longer`
90
- # - `pivot_wider`
91
- # - `rowwise`
92
- # - `separate`
93
- # - `summarize`
94
- # - Recall and use the following operators for their
95
- # intended data wrangling tasks:
96
- # - `==`
97
- # - `%in%`
98
- # - `!`
99
- # - `&`
100
- # - `|`
101
- # - `|>` and `%>%`
102
- ```
103
-
104
66
## Data frames, series, and lists
105
67
106
68
In Chapters {ref}` intro ` and {ref}` reading ` , * data frames* were the focus:
107
69
we learned how to import data into Python as a data frame, and perform basic operations on data frames in Python.
108
- In the remainder of this book, this pattern continues. The vast majority of tools we use will require
70
+ In the remainder of this book, this pattern continues. The vast majority of tools we use will require
109
71
that data are represented as a ` pandas ` ** data frame** in Python. Therefore, in this section,
110
72
we will dig more deeply into what data frames are and how they are represented in Python.
111
73
This knowledge will be helpful in effectively utilizing these objects in our data analyses.
@@ -152,45 +114,29 @@ data set. There are 13 entities in the data set in total, corresponding to the
152
114
A data frame storing data regarding the population of various regions in Canada. In this example data frame, the row that corresponds to the observation for the city of Vancouver is colored yellow, and the column that corresponds to the population variable is colored blue.
153
115
```
154
116
155
- ``` {code-cell} ipython3
156
- :tags: [remove-cell]
157
-
158
- # The following cell was removed because there is no "vector" in Python.
159
- ```
160
-
161
- +++ {"tags": [ "remove-cell"] }
162
-
163
- Python stores the columns of a data frame as either
164
- * lists* or * vectors* . For example, the data frame in Figure
165
- {numref}` fig:02-vectors ` has three vectors whose names are ` region ` , ` year ` and
166
- ` population ` . The next two sections will explain what lists and vectors are.
167
-
168
- ``` {figure} img/data_frame_slides_cdn/data_frame_slides_cdn.005.jpeg
169
- :name: fig:02-vectors
170
- :figclass: caption-hack
171
-
172
- Data frame with three vectors.
173
- ```
174
-
175
- +++
176
-
177
117
### What is a series?
178
118
179
119
``` {index} pandas.Series
180
120
```
181
121
182
- In Python, ` pandas ` ** series** are arrays with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, etc), including a mix of them (objects);
183
- Python has several different basic data types, as shown in {numref}` tab:datatype-table ` .
184
- You can create a ` pandas ` series using the ` pd.Series() ` function. For
185
- example, to create the vector ` region ` as shown in
186
- {numref}` fig:02-series ` , you can write:
122
+ In Python, ` pandas ` ** series** are lists. They are ordered and can be indexed.
123
+ They are strictly 1-dimensional and can contain any data type
124
+ (integers, strings, floats, etc), including a mix of them; Python
125
+ has several different basic data types, as shown in
126
+ {numref}` tab:datatype-table ` .
127
+ You can create a ` pandas ` series using the
128
+ ` pd.Series() ` function. For example, to create the series ` region ` as shown
129
+ in{numref}` fig:02-series ` , you can write:
187
130
188
131
``` {code-cell} ipython3
189
132
import pandas as pd
133
+
190
134
region = pd.Series(["Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa"])
191
135
region
192
136
```
193
137
138
+ ** (FIGURE 14 NEEDS UPDATING: (a) ZERO-BASED INDEXING, (b) TYPE SHOULD BE STRING (NOT CHARACTER))**
139
+
194
140
+++ {"tags": [ ] }
195
141
196
142
``` {figure} img/wrangling/pandas_dataframe_series.png
@@ -200,41 +146,6 @@ region
200
146
Example of a `pandas` series whose type is string.
201
147
```
202
148
203
- +++ {"tags": [ "remove-cell"] }
204
-
205
- ### What is a vector?
206
-
207
- In R, ** vectors** \index{vector}\index{atomic vector|see{vector}} are objects that can contain one or more elements. The vector
208
- elements are ordered, and they must all be of the same ** data type** ;
209
- R has several different basic data types, as shown in {numref}` tab:datatype-table ` .
210
- Figure \@ ref(fig:02-vector) provides an example of a vector where all of the elements are
211
- of character type.
212
- You can create vectors in R using the ` c ` function \index{c function} (` c ` stands for "concatenate"). For
213
- example, to create the vector ` region ` as shown in Figure
214
- \@ ref(fig:02-vector), you would write:
215
-
216
- ``` {r}
217
- year <- c("Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa")
218
- year
219
- ```
220
-
221
- > ** Note:** Technically, these objects are called "atomic vectors." In this book
222
- > we have chosen to call them "vectors," which is how they are most commonly
223
- > referred to in the R community. To be totally precise, "vector" is an umbrella term that
224
- > encompasses both atomic vector and list objects in R. But this creates a
225
- > confusing situation where the term "vector" could
226
- > mean "atomic vector" * or* "the umbrella term for atomic vector and list,"
227
- > depending on context. Very confusing indeed! So to keep things simple, in
228
- > this book we * always* use the term "vector" to refer to "atomic vector."
229
- > We encourage readers who are enthusiastic to learn more to read the
230
- > Vectors chapter of * Advanced R* [ @wickham2019advanced ] .
231
-
232
- ``` {r 02-vector, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Example of a vector whose type is character.", fig.retina = 2, out.width = "100%"}
233
- image_read("img/data_frame_slides_cdn/data_frame_slides_cdn.007.jpeg") %>%
234
- image_crop("3632x590")
235
- ```
236
-
237
- +++
238
149
239
150
``` {code-cell} ipython3
240
151
:tags: [remove-cell]
@@ -265,76 +176,36 @@ image_read("img/data_frame_slides_cdn/data_frame_slides_cdn.007.jpeg") %>%
265
176
266
177
``` {table} Basic data types in Python
267
178
:name: tab:datatype-table
268
- | English name | Type name | Type Category | Description | Example |
269
- | :-------------------- | :--------- | :------------- | :-------------------------------------------- | :----------------------------------------- |
270
- | integer | `int` | Numeric Type | positive/negative whole numbers | `42` |
271
- | floating point number | `float` | Numeric Type | real number in decimal form | `3.14159` |
272
- | boolean | `bool` | Boolean Values | true or false | `True` |
273
- | string | `str` | Sequence Type | text | `"Can I have a cheezburger?"` |
274
- | list | `list` | Sequence Type | a collection of objects - mutable & ordered | `['Ali', 'Xinyi', 'Miriam']` |
275
- | tuple | `tuple` | Sequence Type | a collection of objects - immutable & ordered | `('Thursday', 6, 9, 2018)` |
276
- | dictionary | `dict` | Mapping Type | mapping of key-value pairs | `{'name':'DSCI', 'code':100, 'credits':2}` |
277
- | none | `NoneType` | Null Object | represents no value | `None` |
179
+ | Data type | Abbreviation | Description | Example |
180
+ | :-------------------- | :----------- | :-------------------------------------------- | :----------------------------------------- |
181
+ | integer | `int` | positive/negative whole numbers | `42` |
182
+ | floating point number | `float` | real number in decimal form | `3.14159` |
183
+ | boolean | `bool` | true or false | `True` |
184
+ | string | `str` | text | `"Can I have a cheezburger?"` |
185
+ | none | `NoneType` | represents no value | `None` |
278
186
```
279
187
280
188
+++
281
189
282
- It is important in Python to make sure you represent your data with the correct type.
283
- Many of the ` pandas ` functions we use in this book treat
190
+ It is important in Python to make sure you represent your data with the correct type.
191
+ Many of the ` pandas ` functions we use in this book treat
284
192
the various data types differently. You should use integers and float types
285
193
(which both fall under the "numeric" umbrella type) to represent numbers and perform
286
194
arithmetic. Strings are used to represent data that should
287
- be thought of as "text", such as words, names, paths, URLs, and more.
195
+ be thought of as "text", such as words, names, paths, URLs, and more.
288
196
There are other basic data types in Python, such as * set*
289
197
and * complex* , but we do not use these in this textbook.
290
198
291
- ``` {code-cell} ipython3
292
- :tags: [remove-cell]
293
-
294
- # It is important in R to make sure you represent your data with the correct type.
295
- # Many of the `tidyverse` functions we use in this book treat
296
- # the various data types differently. You should use integers and double types
297
- # (which both fall under the "numeric" umbrella type) to represent numbers and perform
298
- # arithmetic. Doubles are more common than integers in R, though; for instance, a double data type is the
299
- # default when you create a vector of numbers using `c()`, and when you read in
300
- # whole numbers via `read_csv`. Characters are used to represent data that should
301
- # be thought of as "text", such as words, names, paths, URLs, and more. Factors help us
302
- # encode variables that represent *categories*; a factor variable takes one of a discrete
303
- # set of values known as *levels* (one for each category). The levels can be ordered or unordered. Even though
304
- # factors can sometimes *look* like characters, they are not used to represent
305
- # text, words, names, and paths in the way that characters are; in fact, R
306
- # internally stores factors using integers! There are other basic data types in R, such as *raw*
307
- # and *complex*, but we do not use these in this textbook.
308
- ```
309
-
310
199
### What is a list?
311
200
312
201
``` {index} list
313
202
```
314
203
315
204
Lists are built-in objects in Python that have multiple, ordered elements.
316
- ` pandas ` series can be treated as lists with labels (indices).
205
+ ` pandas ` series can be treated as an array with labels (indices).
317
206
318
- ``` {code-cell} ipython3
319
- :tags: [remove-cell]
320
-
321
- # Lists \index{list} are also objects in R that have multiple, ordered elements.
322
- # Vectors and lists differ by the requirement of element type
323
- # consistency. All elements within a single vector must be of the same type (e.g.,
324
- # all elements are characters), whereas elements within a single list can be of
325
- # different types (e.g., characters, integers, logicals, and even other lists).
326
- ```
327
-
328
- +++ {"tags": [ "remove-cell"] }
329
-
330
- ``` {figure} img/data_frame_slides_cdn/data_frame_slides_cdn.008.jpeg
331
- :name: fig:02-vec-vs-list
332
- :figclass: caption-hack
207
+ ** (FIGURE 3.4 FROM THE R-BOOK IS MISSING)**
333
208
334
- A vector versus a list.
335
- ```
336
-
337
- +++
338
209
339
210
### What does this have to do with data frames?
340
211
@@ -345,10 +216,10 @@ A vector versus a list.
345
216
346
217
A data frame is really just series stuck together that follows two rules:
347
218
348
- 1 . Each element itself is a series.
219
+ 1 . Each element itself is a series.
349
220
2 . Each element (series) must have the same length.
350
221
351
- Not all columns in a data frame need to be of the same type.
222
+ Not all columns in a data frame need to be of the same type.
352
223
{numref}` fig:02-dataframe ` shows a data frame where
353
224
the columns are series of different types.
354
225
@@ -361,23 +232,6 @@ the columns are series of different types.
361
232
Data frame and vector types.
362
233
```
363
234
364
- ``` {code-cell} ipython3
365
- :tags: [remove-cell]
366
-
367
- # A data frame \index{data frame!definition} is really a special kind of list that follows two rules:
368
-
369
- # 1. Each element itself must either be a vector or a list.
370
- # 2. Each element (vector or list) must have the same length.
371
-
372
- # Not all columns in a data frame need to be of the same type.
373
- # Figure \@ref(fig:02-dataframe) shows a data frame where
374
- # the columns are vectors of different types.
375
- # But remember: because the columns in this example are *vectors*,
376
- # the elements must be the same data type *within each column.*
377
- # On the other hand, if our data frame had *list* columns, there would be no such requirement.
378
- # It is generally much more common to use *vector* columns, though,
379
- # as the values for a single variable are usually all of the same type.
380
- ```
381
235
382
236
``` {index} type
383
237
```
@@ -386,49 +240,29 @@ Data frame and vector types.
386
240
> For example we can check the class of the Canadian languages data set,
387
241
> ` can_lang ` , we worked with in the previous chapters and we see it is a ` pandas.core.frame.DataFrame ` .
388
242
389
- ``` {code-cell} ipython3
390
- :tags: [remove-cell]
391
-
392
- # The functions from the `tidyverse` package that we use often give us a
393
- # special class of data frame called a *tibble*. Tibbles have some additional \index{tibble}
394
- # features and benefits over the built-in data frame object. These include the
395
- # ability to add useful attributes (such as grouping, which we will discuss later)
396
- # and more predictable type preservation when subsetting.
397
- # Because a tibble is just a data frame with some added features,
398
- # we will collectively refer to both built-in R data frames and
399
- # tibbles as data frames in this book.
400
-
401
- # > **Note:** You can use the function `class` \index{class} on a data object to assess whether a data
402
- # > frame is a built-in R data frame or a tibble. If the data object is a data
403
- # > frame, `class` will return `"data.frame"`. If the data object is a
404
- # > tibble it will return `"tbl_df" "tbl" "data.frame"`. You can easily convert
405
- # > built-in R data frames to tibbles using the `tidyverse` `as_tibble` function.
406
- # > For example we can check the class of the Canadian languages data set,
407
- # > `can_lang`, we worked with in the previous chapters and we see it is a tibble.
408
- ```
409
243
410
244
``` {code-cell} ipython3
411
245
can_lang = pd.read_csv("data/can_lang.csv")
412
246
type(can_lang)
413
247
```
414
248
415
249
Lists, Series and DataFrames are basic types of * data structure* in Python, which
416
- are core to most data analyses. We summarize them in
417
- {numref}` tab:datastructure-table ` . There are several other data structures in the Python programming
250
+ are core to most data analyses. We summarize them in
251
+ {numref}` tab:datastructure-table ` . There are several other data structures in the Python programming
418
252
language (* e.g.,* matrices), but these are beyond the scope of this book.
419
253
420
254
+++
421
255
422
- ``` {table} Basic data structures in Python
256
+ <!-- ```{table} Basic data structures in Python
423
257
:name: tab:datastructure-table
424
258
| Data Structure | Description |
425
259
| --- |------------ |
426
- | list | An 1D ordered collection of values that can store multiple data types at once. |
427
- | Series | An 1D ordered collection of values *with labels* that can store multiple data types at once. |
260
+ | list | A 1D ordered collection of values that can store multiple data types at once. |
261
+ | Series | A 1D ordered collection of values *with labels* that can store multiple data types at once. |
428
262
| DataFrame | A 2D labeled data structure with columns of potentially different types. |
429
263
```
430
264
431
- +++
265
+ +++ -->
432
266
433
267
## Tidy data
434
268
@@ -437,15 +271,15 @@ language (*e.g.,* matrices), but these are beyond the scope of this book.
437
271
438
272
There are many ways a tabular data set can be organized. This chapter will focus
439
273
on introducing the ** tidy data** format of organization and how to make your raw
440
- (and likely messy) data tidy. A tidy data frame satisfies
274
+ (and likely messy) data tidy. A tidy data frame satisfies
441
275
the following three criteria {cite: p }` wickham2014tidy ` :
442
276
443
277
- each row is a single observation,
444
278
- each column is a single variable, and
445
279
- each value is a single cell (i.e., its entry in the data
446
280
frame is not shared with another value).
447
281
448
- {numref}` fig:02-tidy-image ` demonstrates a tidy data set that satisfies these
282
+ {numref}` fig:02-tidy-image ` demonstrates a tidy data set that satisfies these
449
283
three criteria.
450
284
451
285
+++ {"tags": [ ] }
@@ -464,8 +298,8 @@ Tidy data satisfies three criteria.
464
298
465
299
There are many good reasons for making sure your data are tidy as a first step in your analysis.
466
300
The most important is that it is a single, consistent format that nearly every function
467
- in the ` pandas ` recognizes. No matter what the variables and observations
468
- in your data represent, as long as the data frame
301
+ in the ` pandas ` recognizes. No matter what the variables and observations
302
+ in your data represent, as long as the data frame
469
303
is tidy, you can manipulate it, plot it, and analyze it using the same tools.
470
304
If your data is * not* tidy, you will have to write special bespoke code
471
305
in your analysis that will not only be error-prone, but hard for others to understand.
@@ -491,18 +325,18 @@ below!
491
325
``` {index} pandas.DataFrame; melt
492
326
```
493
327
494
- One task that is commonly performed to get data into a tidy format
495
- is to combine values that are stored in separate columns,
328
+ One task that is commonly performed to get data into a tidy format
329
+ is to combine values that are stored in separate columns,
496
330
but are really part of the same variable, into one.
497
- Data is often stored this way
498
- because this format is sometimes more intuitive for human readability
331
+ Data is often stored this way
332
+ because this format is sometimes more intuitive for human readability
499
333
and understanding, and humans create data sets.
500
- In {numref}` fig:02-wide-to-long ` ,
501
- the table on the left is in an untidy, "wide" format because the year values
502
- (2006, 2011, 2016) are stored as column names.
503
- And as a consequence,
504
- the values for population for the various cities
505
- over these years are also split across several columns.
334
+ In {numref}` fig:02-wide-to-long ` ,
335
+ the table on the left is in an untidy, "wide" format because the year values
336
+ (2006, 2011, 2016) are stored as column names.
337
+ And as a consequence,
338
+ the values for population for the various cities
339
+ over these years are also split across several columns.
506
340
507
341
For humans, this table is easy to read, which is why you will often find data
508
342
stored in this wide format. However, this format is difficult to work with
@@ -518,13 +352,16 @@ greatly simplified once the data is tidied.
518
352
519
353
Another problem with data in this format is that we don't know what the
520
354
numbers under each year actually represent. Do those numbers represent
521
- population size? Land area? It's not clear.
522
- To solve both of these problems,
523
- we can reshape this data set to a tidy data format
355
+ population size? Land area? It's not clear.
356
+ To solve both of these problems,
357
+ we can reshape this data set to a tidy data format
524
358
by creating a column called "year" and a column called
525
359
"population." This transformation&mdash ; which makes the data
526
360
"longer"&mdash ; is shown as the right table in
527
- {numref}` fig:02-wide-to-long ` .
361
+ {numref}` fig:02-wide-to-long ` . Note that the number of entries in our data frame
362
+ can change in this transformation. The "untidy" data has 5 rows and 3 columns for
363
+ a total of 15 data, whereas the "tidy" data on the right has 15 rows and 2 columns
364
+ for a total of 30 data.
528
365
529
366
+++ {"tags": [ ] }
530
367
@@ -541,41 +378,42 @@ Melting data from a wide to long data format.
541
378
```
542
379
543
380
We can achieve this effect in Python using the ` .melt ` function from the ` pandas ` package.
544
- The ` .melt ` function combines columns,
545
- and is usually used during tidying data
546
- when we need to make the data frame longer and narrower.
381
+ We say that we "melt" (or "pivot") the wide table into a longer format.
382
+ The ` .melt ` function combines columns,
383
+ and is usually used during tidying data
384
+ when we need to make the data frame longer and narrower.
547
385
To learn how to use ` .melt ` , we will work through an example with the
548
386
` region_lang_top5_cities_wide.csv ` data set. This data set contains the
549
- counts of how many Canadians cited each language as their mother tongue for five
387
+ counts of how many Canadians cited each language as their mother tongue for five
550
388
major Canadian cities (Toronto, Montréal, Vancouver, Calgary and Edmonton) from
551
- the 2016 Canadian census.
552
- To get started,
389
+ the 2016 Canadian census.
390
+ To get started,
553
391
we will use ` pd.read_csv ` to load the (untidy) data.
554
392
555
393
``` {code-cell} ipython3
556
394
lang_wide = pd.read_csv("data/region_lang_top5_cities_wide.csv")
557
395
lang_wide
558
396
```
559
397
560
- What is wrong with the untidy format above?
561
- The table on the left in {numref}` fig:img-pivot-longer-with-table `
398
+ What is wrong with the untidy format above?
399
+ The table on the left in {numref}` fig:img-pivot-longer-with-table `
562
400
represents the data in the "wide" (messy) format.
563
- From a data analysis perspective, this format is not ideal because the values of
564
- the variable * region* (Toronto, Montréal, Vancouver, Calgary and Edmonton)
401
+ From a data analysis perspective, this format is not ideal because the values of
402
+ the variable * region* (Toronto, Montréal, Vancouver, Calgary and Edmonton)
565
403
are stored as column names. Thus they
566
404
are not easily accessible to the data analysis functions we will apply
567
405
to our data set. Additionally, the * mother tongue* variable values are
568
406
spread across multiple columns, which will prevent us from doing any desired
569
407
visualization or statistical tasks until we combine them into one column. For
570
- instance, suppose we want to know the languages with the highest number of
408
+ instance, suppose we want to know the languages with the highest number of
571
409
Canadians reporting it as their mother tongue among all five regions. This
572
- question would be tough to answer with the data in its current format.
573
- We * could* find the answer with the data in this format,
410
+ question would be tough to answer with the data in its current format.
411
+ We * could* find the answer with the data in this format,
574
412
though it would be much easier to answer if we tidy our
575
- data first. If mother tongue were instead stored as one column,
576
- as shown in the tidy data on the right in
413
+ data first. If mother tongue were instead stored as one column,
414
+ as shown in the tidy data on the right in
577
415
{numref}` fig:img-pivot-longer-with-table ` ,
578
- we could simply use one line of code (` df["mother_tongue"].max() ` )
416
+ we could simply use one line of code (` df["mother_tongue"].max() ` )
579
417
to get the maximum value.
580
418
581
419
+++ {"tags": [ ] }
@@ -589,7 +427,7 @@ Going from wide to long with the `.melt` function.
589
427
590
428
+++
591
429
592
- {numref}` fig:img-pivot-longer ` details the arguments that we need to specify
430
+ {numref}` fig:img-pivot-longer ` details the arguments that we need to specify
593
431
in the ` .melt ` function to accomplish this data transformation.
594
432
595
433
+++ {"tags": [ ] }
@@ -613,25 +451,26 @@ We use `.melt` to combine the Toronto, Montréal,
613
451
Vancouver, Calgary, and Edmonton columns into a single column called ` region ` ,
614
452
and create a column called ` mother_tongue ` that contains the count of how many
615
453
Canadians report each language as their mother tongue for each metropolitan
616
- area. We specify ` value_vars ` to be all
617
- the columns between Toronto and Edmonton:
454
+ area:
618
455
619
456
``` {code-cell} ipython3
620
457
lang_mother_tidy = lang_wide.melt(
621
458
id_vars=["category", "language"],
622
- value_vars=["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"],
623
459
var_name="region",
624
460
value_name="mother_tongue",
625
461
)
626
462
627
463
lang_mother_tidy
628
464
```
629
465
466
+ ** (FIGURE 3.9 FROM THE R BOOK IS MISSING)**
467
+
630
468
> ** Note** : In the code above, the call to the
631
469
> ` .melt ` function is split across several lines. This is allowed in
632
- > certain cases; for example, when calling a function as above, as long as the
633
- > line ends with a comma ` , ` Python knows to keep reading on the next line.
634
- > Splitting long lines like this across multiple lines is encouraged
470
+ > certain cases; for example, when calling a function as above, the input
471
+ > arguments are between parentheses ` () ` and Python knows to keep reading on
472
+ > the next line. Each line ends with a comma ` , ` making it easier to read.
473
+ > Splitting long lines like this across multiple lines is encouraged
635
474
> as it helps significantly with code readability. Generally speaking, you should
636
475
> limit each line of code to about 80 characters.
637
476
@@ -656,17 +495,17 @@ been met:
656
495
Suppose we have observations spread across multiple rows rather than in a single
657
496
row. For example, in {numref}` fig:long-to-wide ` , the table on the left is in an
658
497
untidy, long format because the ` count ` column contains three variables
659
- (population, commuter, and incorporated count) and information about each observation
660
- (here, population, commuter, and incorporated counts for a region) is split across three rows.
661
- Remember: one of the criteria for tidy data
498
+ (population, commuter, and incorporated count) and information about each observation
499
+ (here, population, commuter, and incorporated counts for a region) is split across three rows.
500
+ Remember: one of the criteria for tidy data
662
501
is that each observation must be in a single row.
663
502
664
503
Using data in this format&mdash ; where two or more variables are mixed together
665
504
in a single column&mdash ; makes it harder to apply many usual ` pandas ` functions.
666
- For example, finding the maximum number of commuters
505
+ For example, finding the maximum number of commuters
667
506
would require an additional step of filtering for the commuter values
668
507
before the maximum can be computed.
669
- In comparison, if the data were tidy,
508
+ In comparison, if the data were tidy,
670
509
all we would have to do is compute the maximum value for the commuter column.
671
510
To reshape this untidy data set to a tidy (and in this case, wider) format,
672
511
we need to create columns called "population", "commuters", and "incorporated."
@@ -684,12 +523,12 @@ Going from long to wide data.
684
523
+++
685
524
686
525
To tidy this type of data in Python, we can use the ` .pivot ` function.
687
- The ` .pivot ` function generally increases the number of columns (widens)
688
- and decreases the number of rows in a data set.
689
- To learn how to use ` .pivot ` ,
690
- we will work through an example
691
- with the ` region_lang_top5_cities_long.csv ` data set.
692
- This data set contains the number of Canadians reporting
526
+ The ` .pivot ` function generally increases the number of columns (widens)
527
+ and decreases the number of rows in a data set.
528
+ To learn how to use ` .pivot ` ,
529
+ we will work through an example
530
+ with the ` region_lang_top5_cities_long.csv ` data set.
531
+ This data set contains the number of Canadians reporting
693
532
the primary language at home and work for five
694
533
major cities (Toronto, Montréal, Vancouver, Calgary and Edmonton).
695
534
@@ -698,14 +537,14 @@ lang_long = pd.read_csv("data/region_lang_top5_cities_long.csv")
698
537
lang_long
699
538
```
700
539
701
- What makes the data set shown above untidy?
702
- In this example, each observation is a language in a region.
703
- However, each observation is split across multiple rows:
704
- one where the count for ` most_at_home ` is recorded,
705
- and the other where the count for ` most_at_work ` is recorded.
706
- Suppose the goal with this data was to
540
+ What makes the data set shown above untidy?
541
+ In this example, each observation is a language in a region.
542
+ However, each observation is split across multiple rows:
543
+ one where the count for ` most_at_home ` is recorded,
544
+ and the other where the count for ` most_at_work ` is recorded.
545
+ Suppose the goal with this data was to
707
546
visualize the relationship between the number of
708
- Canadians reporting their primary language at home and work.
547
+ Canadians reporting their primary language at home and work.
709
548
Doing that would be difficult with this data in its current form,
710
549
since these two variables are stored in the same column.
711
550
{numref}` fig:img-pivot-wider-table ` shows how this data
@@ -722,7 +561,7 @@ Going from long to wide with the `.pivot` function.
722
561
723
562
+++
724
563
725
- {numref}` fig:img-pivot-wider ` details the arguments that we need to specify
564
+ {numref}` fig:img-pivot-wider ` details the arguments that we need to specify
726
565
in the ` .pivot ` function.
727
566
728
567
+++ {"tags": [ ] }
@@ -754,7 +593,7 @@ lang_home_tidy
754
593
```
755
594
756
595
``` {code-cell} ipython3
757
- lang_home_tidy.dtypes
596
+ lang_home_tidy.info()
758
597
```
759
598
760
599
The data above is now tidy! We can go through the three criteria again to check
@@ -781,11 +620,11 @@ more columns, and we would see the data set "widen."
781
620
``` {index} pandas.Series; str.split, delimiter
782
621
```
783
622
784
- Data are also not considered tidy when multiple values are stored in the same
623
+ Data are also not considered tidy when multiple values are stored in the same
785
624
cell. The data set we show below is even messier than the ones we dealt with
786
625
above: the ` Toronto ` , ` Montréal ` , ` Vancouver ` , ` Calgary ` and ` Edmonton ` columns
787
626
contain the number of Canadians reporting their primary language at home and
788
- work in one column separated by the delimiter (` / ` ). The column names are the
627
+ work in one column separated by the delimiter (` / ` ). The column names are the
789
628
values of a variable, * and* each value does not have its own cell! To turn this
790
629
messy data into tidy data, we'll have to fix these issues.
791
630
@@ -795,28 +634,34 @@ lang_messy
795
634
```
796
635
797
636
First we’ll use ` .melt ` to create two columns, ` region ` and ` value ` ,
798
- similar to what we did previously.
637
+ similar to what we did previously.
799
638
The new ` region ` columns will contain the region names,
800
- and the new column ` value ` will be a temporary holding place for the
801
- data that we need to further separate, i.e., the
639
+ and the new column ` value ` will be a temporary holding place for the
640
+ data that we need to further separate, i.e., the
802
641
number of Canadians reporting their primary language at home and work.
803
642
804
643
``` {code-cell} ipython3
805
644
lang_messy_longer = lang_messy.melt(
806
645
id_vars=["category", "language"],
807
- value_vars=["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"],
808
646
var_name="region",
809
647
value_name="value",
810
648
)
811
649
812
650
lang_messy_longer
813
651
```
814
652
815
- Next we'll use ` .str.split ` to split the ` value ` column into two columns.
816
- One column will contain only the counts of Canadians
817
- that speak each language most at home,
818
- and the other will contain the counts of Canadians
819
- that speak each language most at work for each region.
653
+ Next we'll use ` .str.split ` to split the ` value ` column into two columns.
654
+ How it works is that it takes a single string and splits it into multiple values
655
+ based on the character you tell it to split on. For example:
656
+ ``` {code-cell} ipython3
657
+ "50/0".split("/")
658
+ ```
659
+
660
+ We can use this operation on the columns of our data frame so that
661
+ one column will contain only the counts of Canadians
662
+ that speak each language most at home,
663
+ and the other will contain the counts of Canadians
664
+ that speak each language most at work for each region.
820
665
{numref}` fig:img-separate `
821
666
outlines what we need to specify to use ` .str.split ` .
822
667
@@ -843,7 +688,7 @@ tidy_lang
843
688
```
844
689
845
690
``` {code-cell} ipython3
846
- tidy_lang.dtypes
691
+ tidy_lang.info()
847
692
```
848
693
849
694
Is this data set now tidy? If we recall the three criteria for tidy data:
@@ -856,17 +701,17 @@ We can see that this data now satisfies all three criteria, making it easier to
856
701
analyze. But we aren't done yet! Notice in the table, all of the variables are
857
702
"object" data types. Object data types are columns of strings or columns with mixed types. In the previous example in Section {ref}` pivot-wider ` , the
858
703
` most_at_home ` and ` most_at_work ` variables were ` int64 ` (integer)&mdash ; you can
859
- verify this by calling ` df.dtypes ` &mdash ; which is a type
704
+ verify this by calling ` df.info() ` &mdash ; which is a type
860
705
of numeric data. This change is due to the delimiter (` / ` ) when we read in this
861
706
messy data set. Python read these columns in as string types, and by default,
862
707
` .str.split ` will return columns as object data types.
863
708
864
709
It makes sense for ` region ` , ` category ` , and ` language ` to be stored as a
865
710
object type. However, suppose we want to apply any functions that treat the
866
- ` most_at_home ` and ` most_at_work ` columns as a number (e.g., finding rows
867
- above a numeric threshold of a column).
868
- In that case,
869
- it won't be possible to do if the variable is stored as a ` object ` .
711
+ ` most_at_home ` and ` most_at_work ` columns as a number (e.g., finding rows
712
+ above a numeric threshold of a column).
713
+ In that case,
714
+ it won't be possible to do if the variable is stored as a ` object ` .
870
715
Fortunately, the ` pandas.to_numeric ` function provides a natural way to fix problems
871
716
like this: it will convert the columns to the best numeric data types.
872
717
@@ -887,12 +732,12 @@ like this: it will convert the columns to the best numeric data types.
887
732
888
733
# It makes sense for `region`, `category`, and `language` to be stored as a
889
734
# character (or perhaps factor) type. However, suppose we want to apply any functions that treat the
890
- # `most_at_home` and `most_at_work` columns as a number (e.g., finding rows
891
- # above a numeric threshold of a column).
892
- # In that case,
893
- # it won't be possible to do if the variable is stored as a `character`.
735
+ # `most_at_home` and `most_at_work` columns as a number (e.g., finding rows
736
+ # above a numeric threshold of a column).
737
+ # In that case,
738
+ # it won't be possible to do if the variable is stored as a `character`.
894
739
# Fortunately, the `separate` function provides a natural way to fix problems
895
- # like this: we can set `convert = TRUE` to convert the `most_at_home`
740
+ # like this: we can set `convert = TRUE` to convert the `most_at_home`
896
741
# and `most_at_work` columns to the correct data type.
897
742
```
898
743
@@ -903,126 +748,38 @@ tidy_lang
903
748
```
904
749
905
750
``` {code-cell} ipython3
906
- tidy_lang.dtypes
751
+ tidy_lang.info()
907
752
```
908
753
909
754
Now we see ` most_at_home ` and ` most_at_work ` columns are of ` int64 ` data types,
910
755
indicating they are integer data types (i.e., numbers)!
911
756
912
757
+++
913
758
914
- (loc-iloc)=
915
- ## Using ` .loc[] ` and ` .iloc[] ` to extract a range of columns
916
-
917
- ``` {index} pandas.DataFrame; loc[]
918
- ```
919
-
920
- Now that the ` tidy_lang ` data is indeed * tidy* , we can start manipulating it
921
- using the powerful suite of functions from the ` pandas ` .
922
- For the first example, recall ` .loc[] ` from Chapter {ref}` intro ` ,
923
- which lets us create a subset of columns from a data frame.
924
- Suppose we wanted to select only the columns ` language ` , ` region ` ,
925
- ` most_at_home ` and ` most_at_work ` from the ` tidy_lang ` data set. Using what we
926
- learned in Chapter {ref}` intro ` , we would pass all of these column names into the square brackets:
927
-
928
- ``` {code-cell} ipython3
929
- selected_columns = tidy_lang.loc[:, ["language", "region", "most_at_home", "most_at_work"]]
930
- selected_columns
931
- ```
932
-
933
- ``` {index} pandas.DataFrame; iloc[], column range
934
- ```
935
-
936
- Here we wrote out the names of each of the columns. However, this method is
937
- time-consuming, especially if you have a lot of columns! Another approach is to
938
- index with integers. ` .iloc[] ` make it easier for
939
- us to select columns. For instance, we can use ` .iloc[] ` to choose a
940
- range of columns rather than typing each column name out. To do this, we use the
941
- colon (` : ` ) operator to denote the range. For example, to get all the columns in
942
- the ` tidy_lang ` data frame from ` language ` to ` most_at_work ` , we pass ` : ` before the comma indicating we want to retrieve all rows, and ` 1: ` after the comma indicating we want only columns from index 1 (* i.e.* ` language ` ) and afterwords.
943
-
944
- ``` {code-cell} ipython3
945
- :tags: [remove-cell]
946
-
947
- # Here we wrote out the names of each of the columns. However, this method is
948
- # time-consuming, especially if you have a lot of columns! Another approach is to
949
- # use a "select helper". Select helpers are operators that make it easier for
950
- # us to select columns. For instance, we can use a select helper to choose a
951
- # range of columns rather than typing each column name out. To do this, we use the
952
- # colon (`:`) operator to denote the range. For example, to get all the columns in \index{column range}
953
- # the `tidy_lang` data frame from `language` to `most_at_work` we pass
954
- # `language:most_at_work` as the second argument to the `select` function.
955
- ```
956
-
957
- ``` {code-cell} ipython3
958
- column_range = tidy_lang.iloc[:, 1:]
959
- column_range
960
- ```
961
-
962
- Notice that we get the same output as we did above,
963
- but with less (and clearer!) code. This type of operator
964
- is especially handy for large data sets.
965
-
966
- ``` {index} pandas.Series; str.startswith
967
- ```
968
-
969
- Suppose instead we wanted to extract columns that followed a particular pattern
970
- rather than just selecting a range. For example, let's say we wanted only to select the
971
- columns ` most_at_home ` and ` most_at_work ` . There are other functions that allow
972
- us to select variables based on their names. In particular, we can use the ` .str.startswith ` method
973
- to choose only the columns that start with the word "most":
759
+ ## Using ` [] ` to extract rows or columns
974
760
975
- ``` {code-cell} ipython3
976
- tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]
977
- ```
978
-
979
- ``` {index} pandas.Series; str.contains
980
- ```
981
-
982
- We could also have chosen the columns containing an underscore ` _ ` by using the
983
- ` .str.contains("_") ` , since we notice
984
- the columns we want contain underscores and the others don't.
985
-
986
- ``` {code-cell} ipython3
987
- tidy_lang.loc[:, tidy_lang.columns.str.contains('_')]
988
- ```
989
-
990
- There are many different functions that help with selecting
991
- variables based on certain criteria.
992
- The additional resources section at the end of this chapter
993
- provides a comprehensive resource on these functions.
994
-
995
- ``` {code-cell} ipython3
996
- :tags: [remove-cell]
997
-
998
- # There are many different `select` helpers that select
999
- # variables based on certain criteria.
1000
- # The additional resources section at the end of this chapter
1001
- # provides a comprehensive resource on `select` helpers.
1002
- ```
1003
-
1004
- ## Using ` df[] ` to extract rows
1005
-
1006
- Next, we revisit the ` df[] ` from Chapter {ref}` intro ` ,
1007
- which lets us create a subset of rows from a data frame.
1008
- Recall the argument to the ` df[] ` :
761
+ Now that the ` tidy_lang ` data is indeed * tidy* , we can start manipulating it
762
+ using the powerful suite of functions from the ` pandas ` .
763
+ We revisit the ` [] ` from Chapter {ref}` intro ` ,
764
+ which lets us create a subset of rows from a data frame.
765
+ Recall the argument to the ` [] ` :
1009
766
column names or a logical statement evaluated to either ` True ` or ` False ` ;
1010
- ` df []` works by returning the rows where the logical statement evaluates to ` True ` .
1011
- This section will highlight more advanced usage of the ` df []` function.
767
+ ` [] ` works by returning the rows where the logical statement evaluates to ` True ` .
768
+ This section will highlight more advanced usage of the ` [] ` function.
1012
769
In particular, this section provides an in-depth treatment of the variety of logical statements
1013
- one can use in the ` df []` to select subsets of rows.
770
+ one can use in the ` [] ` to select subsets of rows.
1014
771
1015
772
+++
1016
773
1017
774
### Extracting rows that have a certain value with ` == `
1018
775
Suppose we are only interested in the subset of rows in ` tidy_lang ` corresponding to the
1019
776
official languages of Canada (English and French).
1020
- We can extract these rows by using the * equivalency operator* (` == ` )
1021
- to compare the values of the ` category ` column
1022
- with the value ` "Official languages" ` .
1023
- With these arguments, ` df []` returns a data frame with all the columns
1024
- of the input data frame
1025
- but only the rows we asked for in the logical statement, i.e.,
777
+ We can extract these rows by using the * equivalency operator* (` == ` )
778
+ to compare the values of the ` category ` column
779
+ with the value ` "Official languages" ` .
780
+ With these arguments, ` [] ` returns a data frame with all the columns
781
+ of the input data frame
782
+ but only the rows we asked for in the logical statement, i.e.,
1026
783
those where the ` category ` column holds the value ` "Official languages" ` .
1027
784
We name this data frame ` official_langs ` .
1028
785
@@ -1034,7 +791,7 @@ official_langs
1034
791
### Extracting rows that do not have a certain value with ` != `
1035
792
1036
793
What if we want all the other language categories in the data set * except* for
1037
- those in the ` "Official languages" ` category? We can accomplish this with the ` != `
794
+ those in the ` "Official languages" ` category? We can accomplish this with the ` != `
1038
795
operator, which means "not equal to". So if we want to find all the rows
1039
796
where the ` category ` does * not* equal ` "Official languages" ` we write the code
1040
797
below.
@@ -1046,14 +803,14 @@ tidy_lang[tidy_lang["category"] != "Official languages"]
1046
803
(filter-and)=
1047
804
### Extracting rows satisfying multiple conditions using ` & `
1048
805
1049
- Suppose now we want to look at only the rows
1050
- for the French language in Montréal.
1051
- To do this, we need to filter the data set
1052
- to find rows that satisfy multiple conditions simultaneously.
806
+ Suppose now we want to look at only the rows
807
+ for the French language in Montréal.
808
+ To do this, we need to filter the data set
809
+ to find rows that satisfy multiple conditions simultaneously.
1053
810
We can do this with the ampersand symbol (` & ` ), which
1054
- is interpreted by Python as "and".
1055
- We write the code as shown below to filter the ` official_langs ` data frame
1056
- to subset the rows where ` region == "Montréal" `
811
+ is interpreted by Python as "and".
812
+ We write the code as shown below to filter the ` official_langs ` data frame
813
+ to subset the rows where ` region == "Montréal" `
1057
814
* and* the ` language == "French" ` .
1058
815
1059
816
``` {code-cell} ipython3
@@ -1065,12 +822,12 @@ tidy_lang[(tidy_lang["region"] == "Montréal") & (tidy_lang["language"] == "Fren
1065
822
### Extracting rows satisfying at least one condition using ` | `
1066
823
1067
824
Suppose we were interested in only those rows corresponding to cities in Alberta
1068
- in the ` official_langs ` data set (Edmonton and Calgary).
825
+ in the ` official_langs ` data set (Edmonton and Calgary).
1069
826
We can't use ` & ` as we did above because ` region `
1070
- cannot be both Edmonton * and* Calgary simultaneously.
1071
- Instead, we can use the vertical pipe (` | ` ) logical operator,
1072
- which gives us the cases where one condition * or*
1073
- another condition * or* both are satisfied.
827
+ cannot be both Edmonton * and* Calgary simultaneously.
828
+ Instead, we can use the vertical pipe (` | ` ) logical operator,
829
+ which gives us the cases where one condition * or*
830
+ another condition * or* both are satisfied.
1074
831
In the code below, we ask Python to return the rows
1075
832
where the ` region ` columns are equal to "Calgary" * or* "Edmonton".
1076
833
@@ -1082,20 +839,20 @@ official_langs[
1082
839
1083
840
### Extracting rows with values in a list using ` .isin() `
1084
841
1085
- Next, suppose we want to see the populations of our five cities.
1086
- Let's read in the ` region_data.csv ` file
1087
- that comes from the 2016 Canadian census,
1088
- as it contains statistics for number of households, land area, population
842
+ Next, suppose we want to see the populations of our five cities.
843
+ Let's read in the ` region_data.csv ` file
844
+ that comes from the 2016 Canadian census,
845
+ as it contains statistics for number of households, land area, population
1089
846
and number of dwellings for different regions.
1090
847
1091
848
``` {code-cell} ipython3
1092
849
region_data = pd.read_csv("data/region_data.csv")
1093
850
region_data
1094
851
```
1095
852
1096
- To get the population of the five cities
1097
- we can filter the data set using the ` .isin ` method.
1098
- The ` .isin ` method is used to see if an element belongs to a list.
853
+ To get the population of the five cities
854
+ we can filter the data set using the ` .isin ` method.
855
+ The ` .isin ` method is used to see if an element belongs to a list.
1099
856
Here we are filtering for rows where the value in the ` region ` column
1100
857
matches any of the five cities we are intersted in: Toronto, Montréal,
1101
858
Vancouver, Calgary, and Edmonton.
@@ -1136,7 +893,7 @@ pd.Series(["Vancouver", "Toronto"]).isin(pd.Series(["Toronto", "Vancouver"]))
1136
893
# > elements in `vectorB`. Then the second element of `vectorA` is compared
1137
894
# > to all the elements in `vectorB`, and so on. Notice the difference between `==` and
1138
895
# > `%in%` in the example below.
1139
- # >
896
+ # >
1140
897
# >``` {r}
1141
898
# >c("Vancouver", "Toronto") == c("Toronto", "Vancouver")
1142
899
# >c("Vancouver", "Toronto") %in% c("Toronto", "Vancouver")
@@ -1152,25 +909,135 @@ glue("census_popn", "{0:,.0f}".format(35151728))
1152
909
glue("most_french", "{0:,.0f}".format(2669195))
1153
910
```
1154
911
1155
- We saw in Section {ref}` filter-and ` that
1156
- {glue: text }` most_french ` people reported
1157
- speaking French in Montréal as their primary language at home.
1158
- If we are interested in finding the official languages in regions
1159
- with higher numbers of people who speak it as their primary language at home
1160
- compared to French in Montréal, then we can use ` df []` to obtain rows
1161
- where the value of ` most_at_home ` is greater than
912
+ We saw in Section {ref}` filter-and ` that
913
+ {glue: text }` most_french ` people reported
914
+ speaking French in Montréal as their primary language at home.
915
+ If we are interested in finding the official languages in regions
916
+ with higher numbers of people who speak it as their primary language at home
917
+ compared to French in Montréal, then we can use ` [] ` to obtain rows
918
+ where the value of ` most_at_home ` is greater than
1162
919
{glue: text }` most_french ` .
1163
920
1164
921
``` {code-cell} ipython3
1165
922
official_langs[official_langs["most_at_home"] > 2669195]
1166
923
```
1167
924
1168
- This operation returns a data frame with only one row, indicating that when
1169
- considering the official languages,
1170
- only English in Toronto is reported by more people
1171
- as their primary language at home
925
+ This operation returns a data frame with only one row, indicating that when
926
+ considering the official languages,
927
+ only English in Toronto is reported by more people
928
+ as their primary language at home
1172
929
than French in Montréal according to the 2016 Canadian census.
1173
930
931
+ (loc-iloc)=
932
+ ## Using ` .loc[] ` to filter rows and select columns.
933
+ ``` {index} pandas.DataFrame; loc[]
934
+ ```
935
+
936
+ The ` [] ` operation is only used when you want to filter rows or select columns;
937
+ it cannot be used to do both operations at the same time. This is where ` .loc[] `
938
+ comes in. For the first example, recall ` .loc[] ` from Chapter {ref}` intro ` ,
939
+ which lets us create a subset of columns from a data frame.
940
+ Suppose we wanted to select only the columns ` language ` , ` region ` ,
941
+ ` most_at_home ` and ` most_at_work ` from the ` tidy_lang ` data set. Using what we
942
+ learned in Chapter {ref}` intro ` , we would pass all of these column names into the square brackets:
943
+
944
+ ``` {code-cell} ipython3
945
+ selected_columns = tidy_lang.loc[:, ["language", "region", "most_at_home", "most_at_work"]]
946
+ selected_columns
947
+ ```
948
+ We pass ` : ` before the comma indicating we want to retrieve all rows, and the list indicates
949
+ the columns that we want.
950
+
951
+ Note that we could obtain the same result by stating that we would like all of the columns
952
+ from ` language ` through ` most_at_work ` . Instead of passing a list of all of the column
953
+ names that we want, we can ask for the range of columns ` "language":"most_at_work" ` , which
954
+ you can read as "The columns from ` language ` to (` : ` ) ` most_at_work ` .
955
+
956
+ ``` {code-cell} ipython3
957
+ selected_columns = tidy_lang.loc[:, "language":"most_at_work"]
958
+ selected_columns
959
+ ```
960
+
961
+ Similarly, you can ask for all of the columns including and after ` language ` by doing the following
962
+
963
+ ``` {code-cell} ipython3
964
+ selected_columns = tidy_lang.loc[:, "language":]
965
+ selected_columns
966
+ ```
967
+
968
+ By not putting anything after the ` : ` , python reads this as "from ` language ` until the last column".
969
+ Although the notation for selecting a range using ` : ` is convienent because less code is required,
970
+ it must be used carefully. If you were to re-order columns or add a column to the data frame, the
971
+ output would change. Using a list is more explicit and less prone to potential confusion.
972
+
973
+ Suppose instead we wanted to extract columns that followed a particular pattern
974
+ rather than just selecting a range. For example, let's say we wanted only to select the
975
+ columns ` most_at_home ` and ` most_at_work ` . There are other functions that allow
976
+ us to select variables based on their names. In particular, we can use the ` .str.startswith ` method
977
+ to choose only the columns that start with the word "most":
978
+
979
+ ``` {code-cell} ipython3
980
+ tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]
981
+ ```
982
+
983
+ ``` {index} pandas.Series; str.contains
984
+ ```
985
+
986
+ We could also have chosen the columns containing an underscore ` _ ` by using the
987
+ ` .str.contains("_") ` , since we notice
988
+ the columns we want contain underscores and the others don't.
989
+
990
+ ``` {code-cell} ipython3
991
+ tidy_lang.loc[:, tidy_lang.columns.str.contains('_')]
992
+ ```
993
+
994
+ There are many different functions that help with selecting
995
+ variables based on certain criteria.
996
+ The additional resources section at the end of this chapter
997
+ provides a comprehensive resource on these functions.
998
+
999
+ ``` {code-cell} ipython3
1000
+ :tags: [remove-cell]
1001
+
1002
+ # There are many different `select` helpers that select
1003
+ # variables based on certain criteria.
1004
+ # The additional resources section at the end of this chapter
1005
+ # provides a comprehensive resource on `select` helpers.
1006
+ ```
1007
+
1008
+ ## Using ` .iloc[] ` to extract a range of columns
1009
+ ``` {index} pandas.DataFrame; iloc[], column range
1010
+ ```
1011
+ Another approach for selecting columns is to use ` .iloc[] `
1012
+ which allows us to index with integers rather than the names of the columns.
1013
+ For example, the column names of the ` tidy_lang ` data frame are
1014
+ ` ['category', 'language', 'region', 'most_at_home', 'most_at_work'] ` .
1015
+
1016
+ Then using ` .iloc[] ` you can ask for the ` language ` column by doing
1017
+
1018
+ ``` {code-cell} ipython3
1019
+ column = tidy_lang.iloc[:, 1]
1020
+ column
1021
+ ```
1022
+
1023
+ You can also ask for multiple columns as we did with ` [] ` . We pass ` : ` before
1024
+ the comma indicating we want to retrieve all rows, and ` 1: ` after the comma
1025
+ indicating we want only columns from index 1 (* i.e.* ` language ` ) and afterwords.
1026
+
1027
+ ``` {code-cell} ipython3
1028
+ column_range = tidy_lang.iloc[:, 1:]
1029
+ column_range
1030
+ ```
1031
+
1032
+ This is less commonly used and needs to be used with care; it is easy
1033
+ accidentally put in the wrong integer because you didn't remember if ` language `
1034
+ was column number 1 or 2.
1035
+
1036
+ Notice that we get the same output as we did
1037
+
1038
+ ``` {index} pandas.Series; str.startswith
1039
+ ```
1040
+
1174
1041
+++ {"tags": [ ] }
1175
1042
1176
1043
(pandas-assign)=
@@ -1180,28 +1047,27 @@ than French in Montréal according to the 2016 Canadian census.
1180
1047
1181
1048
### Using ` .assign ` to modify columns
1182
1049
1183
- ``` {index} pandas.DataFrame; df []
1050
+ ``` {index} pandas.DataFrame; []
1184
1051
```
1185
1052
1186
- In Section {ref}` str-split ` ,
1053
+ In Section {ref}` str-split ` ,
1187
1054
when we first read in the ` "region_lang_top5_cities_messy.csv" ` data,
1188
- all of the variables were "object" data types.
1189
- During the tidying process,
1190
- we used the ` pandas.to_numeric ` function
1191
- to convert the ` most_at_home ` and ` most_at_work ` columns
1192
- to the desired integer (i.e., numeric class) data types and then used ` df []` to overwrite columns.
1193
- But suppose we didn't use the ` df []` ,
1055
+ all of the variables were "object" data types.
1056
+ During the tidying process,
1057
+ we used the ` pandas.to_numeric ` function
1058
+ to convert the ` most_at_home ` and ` most_at_work ` columns
1059
+ to the desired integer (i.e., numeric class) data types and then used ` [] ` to overwrite columns.
1060
+ But suppose we didn't use the ` [] ` ,
1194
1061
and needed to modify the columns some other way.
1195
- Below we create such a situation
1062
+ Below we create such a situation
1196
1063
so that we can demonstrate how to use ` .assign `
1197
- to change the column types of a data frame.
1064
+ to change the column types of a data frame.
1198
1065
` .assign ` is a useful function to modify or create new data frame columns.
1199
1066
1200
1067
``` {code-cell} ipython3
1201
1068
lang_messy = pd.read_csv("data/region_lang_top5_cities_messy.csv")
1202
1069
lang_messy_longer = lang_messy.melt(
1203
1070
id_vars=["category", "language"],
1204
- value_vars=["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"],
1205
1071
var_name="region",
1206
1072
value_name="value",
1207
1073
)
@@ -1219,23 +1085,23 @@ official_langs_obj
1219
1085
```
1220
1086
1221
1087
``` {code-cell} ipython3
1222
- official_langs_obj.dtypes
1088
+ official_langs_obj.info()
1223
1089
```
1224
1090
1225
- To use the ` .assign ` method, again we first specify the object to be the data set,
1226
- and in the following arguments,
1227
- we specify the name of the column we want to modify or create
1091
+ To use the ` .assign ` method, again we first specify the object to be the data set,
1092
+ and in the following arguments,
1093
+ we specify the name of the column we want to modify or create
1228
1094
(here ` most_at_home ` and ` most_at_work ` ), an ` = ` sign,
1229
1095
and then the function we want to apply (here ` pandas.to_numeric ` ).
1230
- In the function we want to apply,
1231
- we refer to the column upon which we want it to act
1096
+ In the function we want to apply,
1097
+ we refer to the column upon which we want it to act
1232
1098
(here ` most_at_home ` and ` most_at_work ` ).
1233
1099
In our example, we are naming the columns the same
1234
- names as columns that already exist in the data frame
1235
- ("most\_ at\_ home", "most\_ at\_ work")
1236
- and this will cause ` .assign ` to * overwrite* those columns
1100
+ names as columns that already exist in the data frame
1101
+ ("most\_ at\_ home", "most\_ at\_ work")
1102
+ and this will cause ` .assign ` to * overwrite* those columns
1237
1103
(also referred to as modifying those columns * in-place* ).
1238
- If we were to give the columns a new name,
1104
+ If we were to give the columns a new name,
1239
1105
then ` .assign ` would create new columns with the names we specified.
1240
1106
` .assign ` 's general syntax is detailed in {numref}` fig:img-assign ` .
1241
1107
@@ -1251,7 +1117,7 @@ Syntax for the `.assign` function.
1251
1117
+++
1252
1118
1253
1119
Below we use ` .assign ` to convert the columns ` most_at_home ` and ` most_at_work `
1254
- to numeric data types in the ` official_langs ` data set as described in
1120
+ to numeric data types in the ` official_langs ` data set as described in
1255
1121
{numref}` fig:img-assign ` :
1256
1122
1257
1123
``` {code-cell} ipython3
@@ -1264,7 +1130,7 @@ official_langs_numeric
1264
1130
```
1265
1131
1266
1132
``` {code-cell} ipython3
1267
- official_langs_numeric.dtypes
1133
+ official_langs_numeric.info()
1268
1134
```
1269
1135
1270
1136
Now we see that the ` most_at_home ` and ` most_at_work ` columns are both ` int64 ` (which is a numeric data type)!
@@ -1297,26 +1163,26 @@ the 2016 Canadian census. What does this number mean to us? To understand this
1297
1163
number, we need context. In particular, how many people were in Toronto when
1298
1164
this data was collected? From the 2016 Canadian census profile, the population
1299
1165
of Toronto was reported to be
1300
- {glue: text }` toronto_popn ` people.
1301
- The number of people who report that English is their primary language at home
1302
- is much more meaningful when we report it in this context.
1303
- We can even go a step further and transform this count to a relative frequency
1166
+ {glue: text }` toronto_popn ` people.
1167
+ The number of people who report that English is their primary language at home
1168
+ is much more meaningful when we report it in this context.
1169
+ We can even go a step further and transform this count to a relative frequency
1304
1170
or proportion.
1305
- We can do this by dividing the number of people reporting a given language
1306
- as their primary language at home by the number of people who live in Toronto.
1307
- For example,
1308
- the proportion of people who reported that their primary language at home
1171
+ We can do this by dividing the number of people reporting a given language
1172
+ as their primary language at home by the number of people who live in Toronto.
1173
+ For example,
1174
+ the proportion of people who reported that their primary language at home
1309
1175
was English in the 2016 Canadian census was {glue: text }` prop_eng_tor `
1310
1176
in Toronto.
1311
1177
1312
- Let's use ` .assign ` to create a new column in our data frame
1313
- that holds the proportion of people who speak English
1314
- for our five cities of focus in this chapter.
1315
- To accomplish this, we will need to do two tasks
1178
+ Let's use ` .assign ` to create a new column in our data frame
1179
+ that holds the proportion of people who speak English
1180
+ for our five cities of focus in this chapter.
1181
+ To accomplish this, we will need to do two tasks
1316
1182
beforehand:
1317
1183
1318
1184
1 . Create a list containing the population values for the cities.
1319
- 2 . Filter the ` official_langs ` data frame
1185
+ 2 . Filter the ` official_langs ` data frame
1320
1186
so that we only keep the rows where the language is English.
1321
1187
1322
1188
To create a list containing the population values for the five cities
@@ -1328,7 +1194,7 @@ city_pops = [5928040, 4098927, 2463431, 1392609, 1321426]
1328
1194
city_pops
1329
1195
```
1330
1196
1331
- And next, we will filter the ` official_langs ` data frame
1197
+ And next, we will filter the ` official_langs ` data frame
1332
1198
so that we only keep the rows where the language is English.
1333
1199
We will name the new data frame we get from this ` english_langs ` :
1334
1200
@@ -1337,8 +1203,8 @@ english_langs = official_langs[official_langs["language"] == "English"]
1337
1203
english_langs
1338
1204
```
1339
1205
1340
- Finally, we can use ` .assign ` to create a new column,
1341
- named ` most_at_home_proportion ` , that will have value that corresponds to
1206
+ Finally, we can use ` .assign ` to create a new column,
1207
+ named ` most_at_home_proportion ` , that will have value that corresponds to
1342
1208
the proportion of people reporting English as their primary
1343
1209
language at home.
1344
1210
We will compute this by dividing the column by our vector of city populations.
@@ -1353,14 +1219,14 @@ english_langs
1353
1219
1354
1220
In the computation above, we had to ensure that we ordered the ` city_pops ` vector in the
1355
1221
same order as the cities were listed in the ` english_langs ` data frame.
1356
- This is because Python will perform the division computation we did by dividing
1357
- each element of the ` most_at_home ` column by each element of the
1222
+ This is because Python will perform the division computation we did by dividing
1223
+ each element of the ` most_at_home ` column by each element of the
1358
1224
` city_pops ` list, matching them up by position.
1359
1225
Failing to do this would have resulted in the incorrect math being performed.
1360
1226
1361
- > ** Note:** In more advanced data wrangling,
1362
- > one might solve this problem in a less error-prone way though using
1363
- > a technique called "joins".
1227
+ > ** Note:** In more advanced data wrangling,
1228
+ > one might solve this problem in a less error-prone way though using
1229
+ > a technique called "joins".
1364
1230
> We link to resources that discuss this in the additional
1365
1231
> resources at the end of this chapter.
1366
1232
@@ -1369,21 +1235,21 @@ Failing to do this would have resulted in the incorrect math being performed.
1369
1235
<!--
1370
1236
#### Creating a visualization with tidy data {-}
1371
1237
1372
- Now that we have cleaned and wrangled the data, we can make visualizations or do
1238
+ Now that we have cleaned and wrangled the data, we can make visualizations or do
1373
1239
statistical analyses to answer questions about it! Let's suppose we want to
1374
- answer the question "what proportion of people in each city speak English
1240
+ answer the question "what proportion of people in each city speak English
1375
1241
as their primary language at home in these five cities?" Since the data is
1376
1242
cleaned already, in a few short lines of code, we can use `ggplot` to create a
1377
1243
data visualization to answer this question! Here we create a bar plot to represent the proportions for
1378
1244
each region and color the proportions by language.
1379
1245
1380
- > Don't worry too much about the code to make this plot for now. We will cover
1246
+ > Don't worry too much about the code to make this plot for now. We will cover
1381
1247
> visualizations in detail in Chapter \@ref(viz).
1382
1248
1383
1249
```{r 02-plot, out.width = "100%", fig.cap = "Bar plot of proportions of Canadians reporting English as the most often spoken language at home."}
1384
1250
ggplot(english_langs,
1385
1251
aes(
1386
- x = region,
1252
+ x = region,
1387
1253
y = most_at_home_proportion
1388
1254
)
1389
1255
) +
@@ -1413,7 +1279,7 @@ frame called `data`:
1413
1279
2 ) filter for rows where another column, ` other_col ` , is more than 5, and
1414
1280
3 ) select only the new column ` new_col ` for those rows.
1415
1281
1416
- One way of performing these three steps is to just write
1282
+ One way of performing these three steps is to just write
1417
1283
multiple lines of code, storing temporary objects as you go:
1418
1284
1419
1285
``` {code-cell} ipython3
@@ -1450,7 +1316,7 @@ each subsequent line.
1450
1316
+++
1451
1317
1452
1318
Chaining the sequential functions solves this problem, resulting in cleaner and
1453
- easier-to-follow code.
1319
+ easier-to-follow code.
1454
1320
The code below accomplishes the same thing as the previous
1455
1321
two code blocks:
1456
1322
@@ -1468,8 +1334,8 @@ output = (
1468
1334
:tags: [remove-cell]
1469
1335
1470
1336
# ``` {r eval = F}
1471
- # output <- select(filter(mutate(data, new_col = old_col * 2),
1472
- # other_col > 5),
1337
+ # output <- select(filter(mutate(data, new_col = old_col * 2),
1338
+ # other_col > 5),
1473
1339
# new_col)
1474
1340
# ```
1475
1341
# Code like this can also be difficult to understand. Functions compose (reading
@@ -1479,10 +1345,10 @@ output = (
1479
1345
1480
1346
# The *pipe operator* (`|>`) solves this problem, resulting in cleaner and
1481
1347
# easier-to-follow code. `|>` is built into R so you don't need to load any
1482
- # packages to use it.
1348
+ # packages to use it.
1483
1349
# You can think of the pipe as a physical pipe. It takes the output from the
1484
1350
# function on the left-hand side of the pipe, and passes it as the first argument
1485
- # to the function on the right-hand side of the pipe.
1351
+ # to the function on the right-hand side of the pipe.
1486
1352
# The code below accomplishes the same thing as the previous
1487
1353
# two code blocks:
1488
1354
```
@@ -1491,7 +1357,7 @@ output = (
1491
1357
> lines, similar to when we did this earlier in the chapter
1492
1358
> for long function calls. Again, this is allowed and recommended, especially when
1493
1359
> the chained function calls create a long line of code. Doing this makes
1494
- > your code more readable. When you do this, it is important to use parentheses
1360
+ > your code more readable. When you do this, it is important to use parentheses
1495
1361
> to tell Python that your code is continuing onto the next line.
1496
1362
1497
1363
``` {code-cell} ipython3
@@ -1507,28 +1373,28 @@ output = (
1507
1373
1508
1374
# > **Note:** In this textbook, we will be using the base R pipe operator syntax, `|>`.
1509
1375
# > This base R `|>` pipe operator was inspired by a previous version of the pipe
1510
- # > operator, `%>%`. The `%>%` pipe operator is not built into R
1511
- # > and is from the `magrittr` R package.
1512
- # > The `tidyverse` metapackage imports the `%>%` pipe operator via `dplyr`
1376
+ # > operator, `%>%`. The `%>%` pipe operator is not built into R
1377
+ # > and is from the `magrittr` R package.
1378
+ # > The `tidyverse` metapackage imports the `%>%` pipe operator via `dplyr`
1513
1379
# > (which in turn imports the `magrittr` R package).
1514
- # > There are some other differences between `%>%` and `|>` related to
1515
- # > more advanced R uses, such as sharing and distributing code as R packages,
1516
- # > however, these are beyond the scope of this textbook.
1380
+ # > There are some other differences between `%>%` and `|>` related to
1381
+ # > more advanced R uses, such as sharing and distributing code as R packages,
1382
+ # > however, these are beyond the scope of this textbook.
1517
1383
# > We have this note in the book to make the reader aware that `%>%` exists
1518
- # > as it is still commonly used in data analysis code and in many data science
1384
+ # > as it is still commonly used in data analysis code and in many data science
1519
1385
# > books and other resources.
1520
1386
# > In most cases these two pipes are interchangeable and either can be used.
1521
1387
1522
1388
# \index{pipe}\index{aaapipesymbb@\%>\%|see{pipe}}
1523
1389
```
1524
1390
1525
- ### Chaining ` df []` and ` .loc `
1391
+ ### Chaining ` [] ` and ` .loc `
1526
1392
1527
1393
+++
1528
1394
1529
- Let's work with the tidy ` tidy_lang ` data set from Section {ref}` str-split ` ,
1530
- which contains the number of Canadians reporting their primary language at home
1531
- and work for five major cities
1395
+ Let's work with the tidy ` tidy_lang ` data set from Section {ref}` str-split ` ,
1396
+ which contains the number of Canadians reporting their primary language at home
1397
+ and work for five major cities
1532
1398
(Toronto, Montréal, Vancouver, Calgary, and Edmonton):
1533
1399
1534
1400
``` {code-cell} ipython3
@@ -1537,7 +1403,7 @@ tidy_lang
1537
1403
1538
1404
Suppose we want to create a subset of the data with only the languages and
1539
1405
counts of each language spoken most at home for the city of Vancouver. To do
1540
- this, we can use the ` df []` and ` .loc ` . First, we use ` df []` to
1406
+ this, we can use the ` [] ` and ` .loc ` . First, we use ` [] ` to
1541
1407
create a data frame called ` van_data ` that contains only values for Vancouver.
1542
1408
1543
1409
``` {code-cell} ipython3
@@ -1554,8 +1420,8 @@ van_data_selected
1554
1420
1555
1421
Although this is valid code, there is a more readable approach we could take by
1556
1422
chaining the operations. With chaining, we do not need to create an intermediate
1557
- object to store the output from ` df []` . Instead, we can directly call ` .loc ` upon the
1558
- output of ` df []` :
1423
+ object to store the output from ` [] ` . Instead, we can directly call ` .loc ` upon the
1424
+ output of ` [] ` :
1559
1425
1560
1426
``` {code-cell} ipython3
1561
1427
van_data_selected = tidy_lang[tidy_lang["region"] == "Vancouver"].loc[
@@ -1568,12 +1434,12 @@ van_data_selected
1568
1434
``` {code-cell} ipython3
1569
1435
:tags: [remove-cell]
1570
1436
1571
- # But wait...Why do the `select` and `filter` function calls
1572
- # look different in these two examples?
1573
- # Remember: when you use the pipe,
1574
- # the output of the first function is automatically provided
1575
- # as the first argument for the function that comes after it.
1576
- # Therefore you do not specify the first argument in that function call.
1437
+ # But wait...Why do the `select` and `filter` function calls
1438
+ # look different in these two examples?
1439
+ # Remember: when you use the pipe,
1440
+ # the output of the first function is automatically provided
1441
+ # as the first argument for the function that comes after it.
1442
+ # Therefore you do not specify the first argument in that function call.
1577
1443
# In the code above,
1578
1444
# the first line is just the `tidy_lang` data frame with a pipe.
1579
1445
# The pipe passes the left-hand side (`tidy_lang`) to the first argument of the function on the right (`filter`),
@@ -1591,21 +1457,21 @@ approach is clearer and more readable.
1591
1457
1592
1458
+++
1593
1459
1594
- Chaining can be used with any method in Python.
1595
- Additionally, we can chain together more than two functions.
1596
- For example, we can chain together three functions to:
1460
+ Chaining can be used with any method in Python.
1461
+ Additionally, we can chain together more than two functions.
1462
+ For example, we can chain together three functions to:
1597
1463
1598
- - extract rows (` df []` ) to include only those where the counts of the language most spoken at home are greater than 10,000,
1464
+ - extract rows (` [] ` ) to include only those where the counts of the language most spoken at home are greater than 10,000,
1599
1465
- extract only the columns (` .loc ` ) corresponding to ` region ` , ` language ` and ` most_at_home ` , and
1600
- - sort the data frame rows in order (` .sort_values ` ) by counts of the language most spoken at home
1466
+ - sort the data frame rows in order (` .sort_values ` ) by counts of the language most spoken at home
1601
1467
from smallest to largest.
1602
1468
1603
1469
``` {index} pandas.DataFrame; sort_values
1604
1470
```
1605
1471
1606
- As we saw in Chapter {ref}` intro ` ,
1607
- we can use the ` .sort_values ` function
1608
- to order the rows in the data frame by the values of one or more columns.
1472
+ As we saw in Chapter {ref}` intro ` ,
1473
+ we can use the ` .sort_values ` function
1474
+ to order the rows in the data frame by the values of one or more columns.
1609
1475
Here we pass the column name ` most_at_home ` to sort the data frame rows by the values in that column, in ascending order.
1610
1476
1611
1477
``` {code-cell} ipython3
@@ -1626,7 +1492,7 @@ large_region_lang
1626
1492
# using it as the first argument of the first function. These two choices are equivalent,
1627
1493
# and we get the same result.
1628
1494
# ``` {r}
1629
- # large_region_lang <- tidy_lang |>
1495
+ # large_region_lang <- tidy_lang |>
1630
1496
# filter(most_at_home > 10000) |>
1631
1497
# select(region, language, most_at_home) |>
1632
1498
# arrange(most_at_home)
@@ -1636,12 +1502,12 @@ large_region_lang
1636
1502
```
1637
1503
1638
1504
Now that we've shown you chaining as an alternative to storing
1639
- temporary objects and composing code, does this mean you should * never* store
1640
- temporary objects or compose code? Not necessarily!
1641
- There are times when you will still want to do these things.
1642
- For example, you might store a temporary object before feeding it into a plot function
1505
+ temporary objects and composing code, does this mean you should * never* store
1506
+ temporary objects or compose code? Not necessarily!
1507
+ There are times when you will still want to do these things.
1508
+ For example, you might store a temporary object before feeding it into a plot function
1643
1509
so you can iteratively change the plot without having to
1644
- redo all of your data transformations.
1510
+ redo all of your data transformations.
1645
1511
Additionally, chaining many functions can be overwhelming and difficult to debug;
1646
1512
you may want to store a temporary object midway through to inspect your result
1647
1513
before moving on with further steps.
@@ -1658,12 +1524,12 @@ before moving on with further steps.
1658
1524
```
1659
1525
1660
1526
As a part of many data analyses, we need to calculate a summary value for the
1661
- data (a * summary statistic* ).
1662
- Examples of summary statistics we might want to calculate
1663
- are the number of observations, the average/mean value for a column,
1664
- the minimum value, etc.
1665
- Oftentimes,
1666
- this summary statistic is calculated from the values in a data frame column,
1527
+ data (a * summary statistic* ).
1528
+ Examples of summary statistics we might want to calculate
1529
+ are the number of observations, the average/mean value for a column,
1530
+ the minimum value, etc.
1531
+ Oftentimes,
1532
+ this summary statistic is calculated from the values in a data frame column,
1667
1533
or columns, as shown in {numref}` fig:summarize ` .
1668
1534
1669
1535
+++ {"tags": [ ] }
@@ -1684,11 +1550,11 @@ First a reminder of what `region_lang` looks like:
1684
1550
``` {code-cell} ipython3
1685
1551
:tags: [remove-cell]
1686
1552
1687
- # A useful `dplyr` function for calculating summary statistics is `summarize`,
1553
+ # A useful `dplyr` function for calculating summary statistics is `summarize`,
1688
1554
# where the first argument is the data frame and subsequent arguments
1689
- # are the summaries we want to perform.
1690
- # Here we show how to use the `summarize` function to calculate the minimum
1691
- # and maximum number of Canadians
1555
+ # are the summaries we want to perform.
1556
+ # Here we show how to use the `summarize` function to calculate the minimum
1557
+ # and maximum number of Canadians
1692
1558
# reporting a particular language as their primary language at home.
1693
1559
# First a reminder of what `region_lang` looks like:
1694
1560
```
@@ -1698,9 +1564,9 @@ region_lang = pd.read_csv("data/region_lang.csv")
1698
1564
region_lang
1699
1565
```
1700
1566
1701
- We apply ` min ` to calculate the minimum
1702
- and ` max ` to calculate maximum number of Canadians
1703
- reporting a particular language as their primary language at home,
1567
+ We apply ` min ` to calculate the minimum
1568
+ and ` max ` to calculate maximum number of Canadians
1569
+ reporting a particular language as their primary language at home,
1704
1570
for any region, and ` .assign ` a column name to each:
1705
1571
1706
1572
``` {code-cell} ipython3
@@ -1744,33 +1610,33 @@ people.
1744
1610
``` {index} see: NaN; missing data
1745
1611
```
1746
1612
1747
- In ` pandas ` DataFrame, the value ` NaN ` is often used to denote missing data.
1748
- Many of the base python statistical summary functions
1749
- (e.g., ` max ` , ` min ` , ` sum ` , etc) will return ` NaN `
1750
- when applied to columns containing ` NaN ` values.
1751
- Usually that is not what we want to happen;
1613
+ In ` pandas ` DataFrame, the value ` NaN ` is often used to denote missing data.
1614
+ Many of the base python statistical summary functions
1615
+ (e.g., ` max ` , ` min ` , ` sum ` , etc) will return ` NaN `
1616
+ when applied to columns containing ` NaN ` values.
1617
+ Usually that is not what we want to happen;
1752
1618
instead, we would usually like Python to ignore the missing entries
1753
1619
and calculate the summary statistic using all of the other non-` NaN ` values
1754
1620
in the column.
1755
- Fortunately ` pandas ` provides many equivalent methods (e.g., ` .max ` , ` .min ` , ` .sum ` , etc) to
1621
+ Fortunately ` pandas ` provides many equivalent methods (e.g., ` .max ` , ` .min ` , ` .sum ` , etc) to
1756
1622
these summary functions while providing an extra argument ` skipna ` that lets
1757
1623
us tell the function what to do when it encounters ` NaN ` values.
1758
1624
In particular, if we specify ` skipna=True ` (default), the function will ignore
1759
1625
missing values and return a summary of all the non-missing entries.
1760
1626
We show an example of this below.
1761
1627
1762
1628
First we create a new version of the ` region_lang ` data frame,
1763
- named ` region_lang_na ` , that has a seemingly innocuous ` NaN `
1629
+ named ` region_lang_na ` , that has a seemingly innocuous ` NaN `
1764
1630
in the first row of the ` most_at_home ` column:
1765
1631
1766
1632
``` {code-cell} ipython3
1767
1633
:tags: [remove-cell]
1768
1634
1769
- # In data frames in R, the value `NA` is often used to denote missing data.
1770
- # Many of the base R statistical summary functions
1771
- # (e.g., `max`, `min`, `mean`, `sum`, etc) will return `NA`
1635
+ # In data frames in R, the value `NA` is often used to denote missing data.
1636
+ # Many of the base R statistical summary functions
1637
+ # (e.g., `max`, `min`, `mean`, `sum`, etc) will return `NA`
1772
1638
# when applied to columns containing `NA` values. \index{missing data}\index{NA|see{missing data}}
1773
- # Usually that is not what we want to happen;
1639
+ # Usually that is not what we want to happen;
1774
1640
# instead, we would usually like R to ignore the missing entries
1775
1641
# and calculate the summary statistic using all of the other non-`NA` values
1776
1642
# in the column.
@@ -1792,8 +1658,8 @@ region_lang_na.loc[0, "most_at_home"] = np.nan
1792
1658
region_lang_na
1793
1659
```
1794
1660
1795
- Now if we apply the Python built-in summary function as above,
1796
- we see that we no longer get the minimum and maximum returned,
1661
+ Now if we apply the Python built-in summary function as above,
1662
+ we see that we no longer get the minimum and maximum returned,
1797
1663
but just an ` NaN ` instead!
1798
1664
1799
1665
``` {code-cell} ipython3
@@ -1827,21 +1693,21 @@ lang_summary_na
1827
1693
``` {index} pandas.DataFrame; groupby
1828
1694
```
1829
1695
1830
- A common pairing with summary functions is ` .groupby ` . Pairing these functions
1696
+ A common pairing with summary functions is ` .groupby ` . Pairing these functions
1831
1697
together can let you summarize values for subgroups within a data set,
1832
- as illustrated in {numref}` fig:summarize-groupby ` .
1833
- For example, we can use ` .groupby ` to group the regions of the ` tidy_lang ` data frame and then calculate the minimum and maximum number of Canadians
1834
- reporting the language as the primary language at home
1698
+ as illustrated in {numref}` fig:summarize-groupby ` .
1699
+ For example, we can use ` .groupby ` to group the regions of the ` tidy_lang ` data frame and then calculate the minimum and maximum number of Canadians
1700
+ reporting the language as the primary language at home
1835
1701
for each of the regions in the data set.
1836
1702
1837
1703
``` {code-cell} ipython3
1838
1704
:tags: [remove-cell]
1839
1705
1840
1706
# A common pairing with `summarize` is `group_by`. Pairing these functions \index{group\_by}
1841
1707
# together can let you summarize values for subgroups within a data set,
1842
- # as illustrated in Figure \@ref(fig:summarize-groupby).
1843
- # For example, we can use `group_by` to group the regions of the `tidy_lang` data frame and then calculate the minimum and maximum number of Canadians
1844
- # reporting the language as the primary language at home
1708
+ # as illustrated in Figure \@ref(fig:summarize-groupby).
1709
+ # For example, we can use `group_by` to group the regions of the `tidy_lang` data frame and then calculate the minimum and maximum number of Canadians
1710
+ # reporting the language as the primary language at home
1845
1711
# for each of the regions in the data set.
1846
1712
1847
1713
# (ref:summarize-groupby) `summarize` and `group_by` is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame—with one row for each group—containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The gray, blue, and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example.
@@ -1888,11 +1754,11 @@ Notice that `.groupby` converts a `DataFrame` object to a `DataFrameGroupBy` obj
1888
1754
``` {code-cell} ipython3
1889
1755
:tags: [remove-cell]
1890
1756
1891
- # Notice that `group_by` on its own doesn't change the way the data looks.
1892
- # In the output below, the grouped data set looks the same,
1893
- # and it doesn't *appear* to be grouped by `region`.
1894
- # Instead, `group_by` simply changes how other functions work with the data,
1895
- # as we saw with `summarize` above.
1757
+ # Notice that `group_by` on its own doesn't change the way the data looks.
1758
+ # In the output below, the grouped data set looks the same,
1759
+ # and it doesn't *appear* to be grouped by `region`.
1760
+ # Instead, `group_by` simply changes how other functions work with the data,
1761
+ # as we saw with `summarize` above.
1896
1762
```
1897
1763
1898
1764
``` {code-cell} ipython3
@@ -1905,23 +1771,23 @@ region_lang.groupby("region")
1905
1771
1906
1772
Sometimes we need to summarize statistics across many columns.
1907
1773
An example of this is illustrated in {numref}` fig:summarize-across ` .
1908
- In such a case, using summary functions alone means that we have to
1774
+ In such a case, using summary functions alone means that we have to
1909
1775
type out the name of each column we want to summarize.
1910
- In this section we will meet two strategies for performing this task.
1776
+ In this section we will meet two strategies for performing this task.
1911
1777
First we will see how we can do this using ` .iloc[] ` to slice the columns before applying summary functions.
1912
- Then we will also explore how we can use a more general iteration function,
1778
+ Then we will also explore how we can use a more general iteration function,
1913
1779
` .apply ` , to also accomplish this.
1914
1780
1915
1781
``` {code-cell} ipython3
1916
1782
:tags: [remove-cell]
1917
1783
1918
1784
# Sometimes we need to summarize statistics across many columns.
1919
1785
# An example of this is illustrated in Figure \@ref(fig:summarize-across).
1920
- # In such a case, using `summarize` alone means that we have to
1786
+ # In such a case, using `summarize` alone means that we have to
1921
1787
# type out the name of each column we want to summarize.
1922
- # In this section we will meet two strategies for performing this task.
1788
+ # In this section we will meet two strategies for performing this task.
1923
1789
# First we will see how we can do this using `summarize` + `across`.
1924
- # Then we will also explore how we can use a more general iteration function,
1790
+ # Then we will also explore how we can use a more general iteration function,
1925
1791
# `map`, to also accomplish this.
1926
1792
```
1927
1793
@@ -1943,9 +1809,9 @@ Then we will also explore how we can use a more general iteration function,
1943
1809
``` {index} column range
1944
1810
```
1945
1811
1946
- Recall that in the Section {ref}` loc-iloc ` , we can use ` .iloc[] ` to extract a range of columns with indices. Here we demonstrate finding the maximum value
1812
+ Recall that in the Section {ref}` loc-iloc ` , we can use ` .iloc[] ` to extract a range of columns with indices. Here we demonstrate finding the maximum value
1947
1813
of each of the numeric
1948
- columns of the ` region_lang ` data set through pairing ` .iloc[] ` and ` .max ` . This means that the
1814
+ columns of the ` region_lang ` data set through pairing ` .iloc[] ` and ` .max ` . This means that the
1949
1815
summary methods (* e.g.* ` .min ` , ` .max ` , ` .sum ` etc.) can be used for data frames as well.
1950
1816
1951
1817
``` {code-cell} ipython3
@@ -1958,35 +1824,35 @@ jupyter:
1958
1824
source_hidden: true
1959
1825
tags: [remove-cell]
1960
1826
---
1961
- # To summarize statistics across many columns, we can use the
1827
+ # To summarize statistics across many columns, we can use the
1962
1828
# `summarize` function we have just recently learned about.
1963
- # However, in such a case, using `summarize` alone means that we have to
1964
- # type out the name of each column we want to summarize.
1829
+ # However, in such a case, using `summarize` alone means that we have to
1830
+ # type out the name of each column we want to summarize.
1965
1831
# To do this more efficiently, we can pair `summarize` with `across` \index{across}
1966
1832
# and use a colon `:` to specify a range of columns we would like \index{column range}
1967
1833
# to perform the statistical summaries on.
1968
- # Here we demonstrate finding the maximum value
1834
+ # Here we demonstrate finding the maximum value
1969
1835
# of each of the numeric
1970
1836
# columns of the `region_lang` data set.
1971
1837
1972
1838
# ``` {r 02-across-data}
1973
1839
# region_lang |>
1974
1840
# summarize(across(mother_tongue:lang_known, max))
1975
- # ```
1841
+ # ```
1976
1842
1977
- # > **Note:** Similar to when we use base R statistical summary functions
1978
- # > (e.g., `max`, `min`, `mean`, `sum`, etc) with `summarize` alone,
1979
- # > the use of the `summarize` + `across` functions paired
1843
+ # > **Note:** Similar to when we use base R statistical summary functions
1844
+ # > (e.g., `max`, `min`, `mean`, `sum`, etc) with `summarize` alone,
1845
+ # > the use of the `summarize` + `across` functions paired
1980
1846
# > with base R statistical summary functions
1981
- # > also return `NA`s when we apply them to columns that
1847
+ # > also return `NA`s when we apply them to columns that
1982
1848
# > contain `NA`s in the data frame. \index{missing data}
1983
- # >
1849
+ # >
1984
1850
# > To avoid this, again we need to add the argument `na.rm = TRUE`,
1985
1851
# > but in this case we need to use it a little bit differently.
1986
1852
# > In this case, we need to add a `,` and then `na.rm = TRUE`,
1987
- # > after specifying the function we want `summarize` + `across` to apply,
1853
+ # > after specifying the function we want `summarize` + `across` to apply,
1988
1854
# > as illustrated below:
1989
- # >
1855
+ # >
1990
1856
# > ``` {r}
1991
1857
# > region_lang_na |>
1992
1858
# > summarize(across(mother_tongue:lang_known, max, na.rm = TRUE))
@@ -2005,9 +1871,9 @@ An alternative to aggregating on a dataframe
2005
1871
for applying a function to many columns is the ` .apply ` method.
2006
1872
Let's again find the maximum value of each column of the
2007
1873
` region_lang ` data frame, but using ` .apply ` with the ` max ` function this time.
2008
- We focus on the two arguments of ` .apply ` :
1874
+ We focus on the two arguments of ` .apply ` :
2009
1875
the function that you would like to apply to each column, and the ` axis ` along which the function will be applied (` 0 ` for columns, ` 1 ` for rows).
2010
- Note that ` .apply ` does not have an argument
1876
+ Note that ` .apply ` does not have an argument
2011
1877
to specify * which* columns to apply the function to.
2012
1878
Therefore, we will use the ` .iloc[] ` before calling ` .apply `
2013
1879
to choose the columns for which we want the maximum.
@@ -2018,14 +1884,14 @@ jupyter:
2018
1884
source_hidden: true
2019
1885
tags: [remove-cell]
2020
1886
---
2021
- # An alternative to `summarize` and `across`
1887
+ # An alternative to `summarize` and `across`
2022
1888
# for applying a function to many columns is the `map` family of functions. \index{map}
2023
1889
# Let's again find the maximum value of each column of the
2024
1890
# `region_lang` data frame, but using `map` with the `max` function this time.
2025
- # `map` takes two arguments:
2026
- # an object (a vector, data frame or list) that you want to apply the function to,
1891
+ # `map` takes two arguments:
1892
+ # an object (a vector, data frame or list) that you want to apply the function to,
2027
1893
# and the function that you would like to apply to each column.
2028
- # Note that `map` does not have an argument
1894
+ # Note that `map` does not have an argument
2029
1895
# to specify *which* columns to apply the function to.
2030
1896
# Therefore, we will use the `select` function before calling `map`
2031
1897
# to choose the columns for which we want the maximum.
@@ -2038,15 +1904,15 @@ pd.DataFrame(region_lang.iloc[:, 3:].apply(max, axis=0)).T
2038
1904
``` {index} missing data
2039
1905
```
2040
1906
2041
- > ** Note:** Similar to when we use base Python statistical summary functions
2042
- > (e.g., ` max ` , ` min ` , ` sum ` , etc.) when there are ` NaN ` s,
1907
+ > ** Note:** Similar to when we use base Python statistical summary functions
1908
+ > (e.g., ` max ` , ` min ` , ` sum ` , etc.) when there are ` NaN ` s,
2043
1909
> ` .apply ` functions paired with base Python statistical summary functions
2044
- > also return ` NaN ` values when we apply them to columns that
2045
- > contain ` NaN ` values.
2046
- >
1910
+ > also return ` NaN ` values when we apply them to columns that
1911
+ > contain ` NaN ` values.
1912
+ >
2047
1913
> To avoid this, again we need to use the ` pandas ` variants of summary functions (* i.e.*
2048
1914
> ` .max ` , ` .min ` , ` .sum ` , etc.) with ` skipna=True ` .
2049
- > When we use this with ` .apply ` , we do this by constructing a anonymous function that calls
1915
+ > When we use this with ` .apply ` , we do this by constructing a anonymous function that calls
2050
1916
> the ` .max ` method with ` skipna=True ` , as illustrated below:
2051
1917
2052
1918
``` {code-cell} ipython3
@@ -2055,17 +1921,17 @@ pd.DataFrame(
2055
1921
).T
2056
1922
```
2057
1923
2058
- The ` .apply ` function is generally quite useful for solving many problems
2059
- involving repeatedly applying functions in Python.
2060
- Additionally, a variant of ` .apply ` is ` .applymap ` ,
1924
+ The ` .apply ` function is generally quite useful for solving many problems
1925
+ involving repeatedly applying functions in Python.
1926
+ Additionally, a variant of ` .apply ` is ` .applymap ` ,
2061
1927
which can be used to apply functions element-wise.
2062
1928
To learn more about these functions, see the additional resources
2063
1929
section at the end of this chapter.
2064
1930
2065
1931
+++ {"jp-MarkdownHeadingCollapsed": true, "tags": [ "remove-cell"] }
2066
1932
2067
1933
<!-- > **Note:** The `map` function comes from the `purrr` package. But since
2068
- > `purrr` is part of the tidyverse, once we call `library(tidyverse)` we
1934
+ > `purrr` is part of the tidyverse, once we call `library(tidyverse)` we
2069
1935
> do not need to load the `purrr` package separately.
2070
1936
2071
1937
The output looks a bit weird... we passed in a data frame, but the output
@@ -2080,7 +1946,7 @@ region_lang |>
2080
1946
```
2081
1947
2082
1948
So what do we do? Should we convert this to a data frame? We could, but a
2083
- simpler alternative is to just use a different `map` function. There
1949
+ simpler alternative is to just use a different `map` function. There
2084
1950
are quite a few to choose from, they all work similarly, but
2085
1951
their name reflects the type of output you want from the mapping operation.
2086
1952
Table \@ref(tab:map-table) lists the commonly used `map` functions as well
@@ -2107,24 +1973,24 @@ region_lang |>
2107
1973
map_dfr(max)
2108
1974
```
2109
1975
2110
- > **Note:** Similar to when we use base R statistical summary functions
2111
- > (e.g., `max`, `min`, `mean`, `sum`, etc.) with `summarize`,
1976
+ > **Note:** Similar to when we use base R statistical summary functions
1977
+ > (e.g., `max`, `min`, `mean`, `sum`, etc.) with `summarize`,
2112
1978
> `map` functions paired with base R statistical summary functions
2113
- > also return `NA` values when we apply them to columns that
1979
+ > also return `NA` values when we apply them to columns that
2114
1980
> contain `NA` values. \index{missing data}
2115
- >
1981
+ >
2116
1982
> To avoid this, again we need to add the argument `na.rm = TRUE`.
2117
- > When we use this with `map`, we do this by adding a `,`
1983
+ > When we use this with `map`, we do this by adding a `,`
2118
1984
> and then `na.rm = TRUE` after specifying the function, as illustrated below:
2119
- >
1985
+ >
2120
1986
> ``` {r}
2121
1987
> region_lang_na |>
2122
1988
> select(mother_tongue:lang_known) |>
2123
1989
> map_dfr(max, na.rm = TRUE)
2124
1990
> ```
2125
1991
2126
- The `map` functions are generally quite useful for solving many problems
2127
- involving repeatedly applying functions in R.
1992
+ The `map` functions are generally quite useful for solving many problems
1993
+ involving repeatedly applying functions in R.
2128
1994
Additionally, their use is not limited to columns of a data frame;
2129
1995
`map` family functions can be used to apply functions to elements of a vector,
2130
1996
or a list, and even to lists of (nested!) data frames.
@@ -2135,8 +2001,8 @@ section at the end of this chapter. -->
2135
2001
2136
2002
## Apply functions across many columns with ` .apply `
2137
2003
2138
- Sometimes we need to apply a function to many columns in a data frame.
2139
- For example, we would need to do this when converting units of measurements across many columns.
2004
+ Sometimes we need to apply a function to many columns in a data frame.
2005
+ For example, we would need to do this when converting units of measurements across many columns.
2140
2006
We illustrate such a data transformation in {numref}` fig:mutate-across ` .
2141
2007
2142
2008
+++ {"tags": [ ] }
@@ -2150,11 +2016,11 @@ We illustrate such a data transformation in {numref}`fig:mutate-across`.
2150
2016
2151
2017
+++
2152
2018
2153
- For example,
2154
- imagine that we wanted to convert all the numeric columns
2155
- in the ` region_lang ` data frame from ` int64 ` type to ` int32 ` type
2019
+ For example,
2020
+ imagine that we wanted to convert all the numeric columns
2021
+ in the ` region_lang ` data frame from ` int64 ` type to ` int32 ` type
2156
2022
using the ` .as_type ` function.
2157
- When we revisit the ` region_lang ` data frame,
2023
+ When we revisit the ` region_lang ` data frame,
2158
2024
we can see that this would be the columns from ` mother_tongue ` to ` lang_known ` .
2159
2025
2160
2026
``` {code-cell} ipython3
@@ -2163,11 +2029,11 @@ jupyter:
2163
2029
source_hidden: true
2164
2030
tags: [remove-cell]
2165
2031
---
2166
- # For example,
2167
- # imagine that we wanted to convert all the numeric columns
2168
- # in the `region_lang` data frame from double type to integer type
2032
+ # For example,
2033
+ # imagine that we wanted to convert all the numeric columns
2034
+ # in the `region_lang` data frame from double type to integer type
2169
2035
# using the `as.integer` function.
2170
- # When we revisit the `region_lang` data frame,
2036
+ # When we revisit the `region_lang` data frame,
2171
2037
# we can see that this would be the columns from `mother_tongue` to `lang_known`.
2172
2038
```
2173
2039
@@ -2179,12 +2045,12 @@ region_lang
2179
2045
```
2180
2046
2181
2047
To accomplish such a task, we can use ` .apply ` .
2182
- This works in a similar way for column selection,
2048
+ This works in a similar way for column selection,
2183
2049
as we saw when we used in Section {ref}` apply-summary ` earlier.
2184
- As we did above,
2050
+ As we did above,
2185
2051
we again use ` .iloc ` to specify the columns
2186
2052
as well as the ` .apply ` to specify the function we want to apply on these columns.
2187
- However, a key difference here is that we are not using aggregating function here,
2053
+ However, a key difference here is that we are not using aggregating function here,
2188
2054
which means that we get back a data frame with the same number of rows.
2189
2055
2190
2056
``` {code-cell} ipython3
@@ -2194,17 +2060,17 @@ jupyter:
2194
2060
tags: [remove-cell]
2195
2061
---
2196
2062
# To accomplish such a task, we can use `mutate` paired with `across`. \index{across}
2197
- # This works in a similar way for column selection,
2063
+ # This works in a similar way for column selection,
2198
2064
# as we saw when we used `summarize` + `across` earlier.
2199
- # As we did above,
2065
+ # As we did above,
2200
2066
# we again use `across` to specify the columns using `select` syntax
2201
2067
# as well as the function we want to apply on the specified columns.
2202
- # However, a key difference here is that we are using `mutate`,
2068
+ # However, a key difference here is that we are using `mutate`,
2203
2069
# which means that we get back a data frame with the same number of rows.
2204
2070
```
2205
2071
2206
2072
``` {code-cell} ipython3
2207
- region_lang.dtypes
2073
+ region_lang.info()
2208
2074
```
2209
2075
2210
2076
``` {code-cell} ipython3
@@ -2214,19 +2080,19 @@ region_lang_int32
2214
2080
```
2215
2081
2216
2082
``` {code-cell} ipython3
2217
- region_lang_int32.dtypes
2083
+ region_lang_int32.info()
2218
2084
```
2219
2085
2220
2086
We see that we get back a data frame
2221
2087
with the same number of columns and rows.
2222
- The only thing that changes is the transformation we applied
2088
+ The only thing that changes is the transformation we applied
2223
2089
to the specified columns (here ` mother_tongue ` to ` lang_known ` ).
2224
2090
2225
2091
+++
2226
2092
2227
2093
## Apply functions across columns within one row with ` .apply `
2228
2094
2229
- What if you want to apply a function across columns but within one row?
2095
+ What if you want to apply a function across columns but within one row?
2230
2096
We illustrate such a data transformation in {numref}` fig:rowwise ` .
2231
2097
2232
2098
+++ {"tags": [ ] }
@@ -2241,12 +2107,12 @@ We illustrate such a data transformation in {numref}`fig:rowwise`.
2241
2107
+++
2242
2108
2243
2109
For instance, suppose we want to know the maximum value between ` mother_tongue ` ,
2244
- ` most_at_home ` , ` most_at_work `
2110
+ ` most_at_home ` , ` most_at_work `
2245
2111
and ` lang_known ` for each language and region
2246
2112
in the ` region_lang ` data set.
2247
2113
In other words, we want to apply the ` max ` function * row-wise.*
2248
2114
Before we use ` .apply ` , we will again use ` .iloc ` to select only the count columns
2249
- so we can see all the columns in the data frame's output easily in the book.
2115
+ so we can see all the columns in the data frame's output easily in the book.
2250
2116
So for this demonstration, the data set we are operating on looks like this:
2251
2117
2252
2118
``` {code-cell} ipython3
@@ -2256,15 +2122,15 @@ jupyter:
2256
2122
tags: [remove-cell]
2257
2123
---
2258
2124
# For instance, suppose we want to know the maximum value between `mother_tongue`,
2259
- # `most_at_home`, `most_at_work`
2125
+ # `most_at_home`, `most_at_work`
2260
2126
# and `lang_known` for each language and region
2261
2127
# in the `region_lang` data set.
2262
2128
# In other words, we want to apply the `max` function *row-wise.*
2263
- # We will use the (aptly named) `rowwise` function in combination with `mutate`
2264
- # to accomplish this task.
2129
+ # We will use the (aptly named) `rowwise` function in combination with `mutate`
2130
+ # to accomplish this task.
2265
2131
2266
2132
# Before we apply `rowwise`, we will `select` only the count columns \index{rowwise}
2267
- # so we can see all the columns in the data frame's output easily in the book.
2133
+ # so we can see all the columns in the data frame's output easily in the book.
2268
2134
# So for this demonstration, the data set we are operating on looks like this:
2269
2135
```
2270
2136
@@ -2274,7 +2140,7 @@ region_lang.iloc[:, 3:]
2274
2140
2275
2141
Now we use ` .apply ` with argument ` axis=1 ` , to tell Python that we would like
2276
2142
the ` max ` function to be applied across, and within, a row,
2277
- as opposed to being applied on a column
2143
+ as opposed to being applied on a column
2278
2144
(which is the default behavior of ` .apply ` ):
2279
2145
2280
2146
``` {code-cell} ipython3
@@ -2285,7 +2151,7 @@ tags: [remove-cell]
2285
2151
---
2286
2152
# Now we apply `rowwise` before `mutate`, to tell R that we would like
2287
2153
# the mutate function to be applied across, and within, a row,
2288
- # as opposed to being applied on a column
2154
+ # as opposed to being applied on a column
2289
2155
# (which is the default behavior of `mutate`):
2290
2156
```
2291
2157
@@ -2297,7 +2163,7 @@ region_lang_rowwise = region_lang.assign(
2297
2163
region_lang_rowwise
2298
2164
```
2299
2165
2300
- We see that we get an additional column added to the data frame,
2166
+ We see that we get an additional column added to the data frame,
2301
2167
named ` maximum ` , which is the maximum value between ` mother_tongue ` ,
2302
2168
` most_at_home ` , ` most_at_work ` and ` lang_known ` for each language
2303
2169
and region.
@@ -2308,52 +2174,52 @@ jupyter:
2308
2174
source_hidden: true
2309
2175
tags: [remove-cell]
2310
2176
---
2311
- # Similar to `group_by`,
2312
- # `rowwise` doesn't appear to do anything when it is called by itself.
2313
- # However, we can apply `rowwise` in combination
2177
+ # Similar to `group_by`,
2178
+ # `rowwise` doesn't appear to do anything when it is called by itself.
2179
+ # However, we can apply `rowwise` in combination
2314
2180
# with other functions to change how these other functions operate on the data.
2315
- # Notice if we used `mutate` without `rowwise`,
2316
- # we would have computed the maximum value across *all* rows
2317
- # rather than the maximum value for *each* row.
2181
+ # Notice if we used `mutate` without `rowwise`,
2182
+ # we would have computed the maximum value across *all* rows
2183
+ # rather than the maximum value for *each* row.
2318
2184
# Below we show what would have happened had we not used
2319
- # `rowwise`. In particular, the same maximum value is reported
2185
+ # `rowwise`. In particular, the same maximum value is reported
2320
2186
# in every single row; this code does not provide the desired result.
2321
2187
2322
2188
# ```{r}
2323
- # region_lang |>
2189
+ # region_lang |>
2324
2190
# select(mother_tongue:lang_known) |>
2325
- # mutate(maximum = max(c(mother_tongue,
2326
- # most_at_home,
2327
- # most_at_home,
2191
+ # mutate(maximum = max(c(mother_tongue,
2192
+ # most_at_home,
2193
+ # most_at_home,
2328
2194
# lang_known)))
2329
2195
# ```
2330
2196
```
2331
2197
2332
2198
## Summary
2333
2199
2334
- Cleaning and wrangling data can be a very time-consuming process. However,
2200
+ Cleaning and wrangling data can be a very time-consuming process. However,
2335
2201
it is a critical step in any data analysis. We have explored many different
2336
- functions for cleaning and wrangling data into a tidy format.
2337
- {numref}` tab:summary-functions-table ` summarizes some of the key wrangling
2338
- functions we learned in this chapter. In the following chapters, you will
2339
- learn how you can take this tidy data and do so much more with it to answer your
2202
+ functions for cleaning and wrangling data into a tidy format.
2203
+ {numref}` tab:summary-functions-table ` summarizes some of the key wrangling
2204
+ functions we learned in this chapter. In the following chapters, you will
2205
+ learn how you can take this tidy data and do so much more with it to answer your
2340
2206
burning data science questions!
2341
2207
2342
2208
+++
2343
2209
2344
- ``` {table} Summary of wrangling functions
2210
+ ``` {table} Summary of wrangling functions
2345
2211
:name: tab:summary-functions-table
2346
2212
2347
2213
| Function | Description |
2348
- | --- | ----------- |
2214
+ | --- | ----------- |
2349
2215
| `.agg` | calculates aggregated summaries of inputs |
2350
- | `.apply` | allows you to apply function(s) to multiple columns/rows |
2351
- | `.assign` | adds or modifies columns in a data frame |
2216
+ | `.apply` | allows you to apply function(s) to multiple columns/rows |
2217
+ | `.assign` | adds or modifies columns in a data frame |
2352
2218
| `.groupby` | allows you to apply function(s) to groups of rows |
2353
2219
| `.iloc` | subsets columns/rows of a data frame using integer indices |
2354
- | `.loc` | subsets columns/rows of a data frame using labels |
2220
+ | `.loc` | subsets columns/rows of a data frame using labels |
2355
2221
| `.melt` | generally makes the data frame longer and narrower |
2356
- | `.pivot` | generally makes a data frame wider and decreases the number of rows |
2222
+ | `.pivot` | generally makes a data frame wider and decreases the number of rows |
2357
2223
| `.str.split` | splits up a string column into multiple columns |
2358
2224
```
2359
2225
@@ -2365,37 +2231,37 @@ tags: [remove-cell]
2365
2231
---
2366
2232
# ## Summary
2367
2233
2368
- # Cleaning and wrangling data can be a very time-consuming process. However,
2234
+ # Cleaning and wrangling data can be a very time-consuming process. However,
2369
2235
# it is a critical step in any data analysis. We have explored many different
2370
- # functions for cleaning and wrangling data into a tidy format.
2371
- # Table \@ref(tab:summary-functions-table) summarizes some of the key wrangling
2372
- # functions we learned in this chapter. In the following chapters, you will
2373
- # learn how you can take this tidy data and do so much more with it to answer your
2236
+ # functions for cleaning and wrangling data into a tidy format.
2237
+ # Table \@ref(tab:summary-functions-table) summarizes some of the key wrangling
2238
+ # functions we learned in this chapter. In the following chapters, you will
2239
+ # learn how you can take this tidy data and do so much more with it to answer your
2374
2240
# burning data science questions!
2375
2241
2376
2242
# \newpage
2377
2243
2378
- # Table: (#tab:summary-functions-table) Summary of wrangling functions
2244
+ # Table: (#tab:summary-functions-table) Summary of wrangling functions
2379
2245
2380
2246
# | Function | Description |
2381
- # | --- | ----------- |
2382
- # | `across` | allows you to apply function(s) to multiple columns |
2383
- # | `filter` | subsets rows of a data frame |
2247
+ # | --- | ----------- |
2248
+ # | `across` | allows you to apply function(s) to multiple columns |
2249
+ # | `filter` | subsets rows of a data frame |
2384
2250
# | `group_by` | allows you to apply function(s) to groups of rows |
2385
2251
# | `mutate` | adds or modifies columns in a data frame |
2386
2252
# | `map` | general iteration function |
2387
2253
# | `pivot_longer` | generally makes the data frame longer and narrower |
2388
- # | `pivot_wider` | generally makes a data frame wider and decreases the number of rows |
2389
- # | `rowwise` | applies functions across columns within one row |
2390
- # | `separate` | splits up a character column into multiple columns |
2254
+ # | `pivot_wider` | generally makes a data frame wider and decreases the number of rows |
2255
+ # | `rowwise` | applies functions across columns within one row |
2256
+ # | `separate` | splits up a character column into multiple columns |
2391
2257
# | `select` | subsets columns of a data frame |
2392
2258
# | `summarize` | calculates summaries of inputs |
2393
2259
```
2394
2260
2395
2261
## Exercises
2396
2262
2397
- Practice exercises for the material covered in this chapter
2398
- can be found in the accompanying
2263
+ Practice exercises for the material covered in this chapter
2264
+ can be found in the accompanying
2399
2265
[ worksheets repository] ( https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme )
2400
2266
in the "Cleaning and wrangling data" row.
2401
2267
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -2407,7 +2273,7 @@ and guidance that the worksheets provide will function as intended.
2407
2273
2408
2274
+++ {"tags": [ ] }
2409
2275
2410
- ## Additional resources
2276
+ ## Additional resources
2411
2277
2412
2278
- The [ ` pandas ` package documentation] ( https://pandas.pydata.org/docs/reference/index.html ) is
2413
2279
another resource to learn more about the functions in this
@@ -2433,14 +2299,14 @@ jupyter:
2433
2299
source_hidden: true
2434
2300
tags: [remove-cell]
2435
2301
---
2436
- # ## Additional resources
2302
+ # ## Additional resources
2437
2303
2438
2304
# - As we mentioned earlier, `tidyverse` is actually an *R
2439
2305
# meta package*: it installs and loads a collection of R packages that all
2440
2306
# follow the tidy data philosophy we discussed above. One of the `tidyverse`
2441
2307
# packages is `dplyr`—a data wrangling workhorse. You have already met many
2442
- # of `dplyr`'s functions
2443
- # (`select`, `filter`, `mutate`, `arrange`, `summarize`, and `group_by`).
2308
+ # of `dplyr`'s functions
2309
+ # (`select`, `filter`, `mutate`, `arrange`, `summarize`, and `group_by`).
2444
2310
# To learn more about these functions and meet a few more useful
2445
2311
# functions, we recommend you check out Chapters 5-9 of the [STAT545 online notes](https://stat545.com/).
2446
2312
# of the data wrangling, exploration, and analysis with R book.
@@ -2450,10 +2316,10 @@ tags: [remove-cell]
2450
2316
# The site also provides a very nice cheat sheet that summarizes many of the
2451
2317
# data wrangling functions from this chapter.
2452
2318
# - Check out the [`tidyselect` R package page](https://tidyselect.r-lib.org/index.html)
2453
- # [@tidyselect] for a comprehensive list of `select` helpers.
2454
- # These helpers can be used to choose columns in a data frame when paired with the `select` function
2319
+ # [@tidyselect] for a comprehensive list of `select` helpers.
2320
+ # These helpers can be used to choose columns in a data frame when paired with the `select` function
2455
2321
# (and other functions that use the `tidyselect` syntax, such as `pivot_longer`).
2456
- # The [documentation for `select` helpers](https://tidyselect.r-lib.org/reference/select_helpers.html)
2322
+ # The [documentation for `select` helpers](https://tidyselect.r-lib.org/reference/select_helpers.html)
2457
2323
# is a useful reference to find the helper you need for your particular problem.
2458
2324
# - *R for Data Science* [@wickham2016r] has a few chapters related to
2459
2325
# data wrangling that go into more depth than this book. For example, the
@@ -2476,4 +2342,4 @@ tags: [remove-cell]
2476
2342
2477
2343
``` {bibliography}
2478
2344
:filter: docname in docnames
2479
- ```
2345
+ ```
0 commit comments