Skip to content
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.

Commit 51a6a7f

Browse files
committedDec 23, 2022
wip on ch3
1 parent 2e286fe commit 51a6a7f

File tree

1 file changed

+530
-664
lines changed

1 file changed

+530
-664
lines changed
 

‎source/wrangling.md

Lines changed: 530 additions & 664 deletions
Original file line numberDiff line numberDiff line change
@@ -55,57 +55,19 @@ By the end of the chapter, readers will be able to do the following:
5555
- `.str.split`
5656
- Recall and use the following operators for their
5757
intended data wrangling tasks:
58-
- `==`
58+
- `==`
5959
- `in`
6060
- `and`
6161
- `or`
62-
- `df[]`
62+
- `[]`
6363
- `.iloc[]`
6464
- `.loc[]`
6565

66-
```{code-cell} ipython3
67-
---
68-
jupyter:
69-
source_hidden: true
70-
tags: [remove-cell]
71-
---
72-
# By the end of the chapter, readers will be able to do the following:
73-
74-
# - Define the term "tidy data".
75-
# - Discuss the advantages of storing data in a tidy data format.
76-
# - Define what vectors, lists, and data frames are in R, and describe how they relate to
77-
# each other.
78-
# - Describe the common types of data in R and their uses.
79-
# - Recall and use the following functions for their
80-
# intended data wrangling tasks:
81-
# - `across`
82-
# - `c`
83-
# - `filter`
84-
# - `group_by`
85-
# - `select`
86-
# - `map`
87-
# - `mutate`
88-
# - `pull`
89-
# - `pivot_longer`
90-
# - `pivot_wider`
91-
# - `rowwise`
92-
# - `separate`
93-
# - `summarize`
94-
# - Recall and use the following operators for their
95-
# intended data wrangling tasks:
96-
# - `==`
97-
# - `%in%`
98-
# - `!`
99-
# - `&`
100-
# - `|`
101-
# - `|>` and `%>%`
102-
```
103-
10466
## Data frames, series, and lists
10567

10668
In Chapters {ref}`intro` and {ref}`reading`, *data frames* were the focus:
10769
we learned how to import data into Python as a data frame, and perform basic operations on data frames in Python.
108-
In the remainder of this book, this pattern continues. The vast majority of tools we use will require
70+
In the remainder of this book, this pattern continues. The vast majority of tools we use will require
10971
that data are represented as a `pandas` **data frame** in Python. Therefore, in this section,
11072
we will dig more deeply into what data frames are and how they are represented in Python.
11173
This knowledge will be helpful in effectively utilizing these objects in our data analyses.
@@ -152,45 +114,29 @@ data set. There are 13 entities in the data set in total, corresponding to the
152114
A data frame storing data regarding the population of various regions in Canada. In this example data frame, the row that corresponds to the observation for the city of Vancouver is colored yellow, and the column that corresponds to the population variable is colored blue.
153115
```
154116

155-
```{code-cell} ipython3
156-
:tags: [remove-cell]
157-
158-
# The following cell was removed because there is no "vector" in Python.
159-
```
160-
161-
+++ {"tags": ["remove-cell"]}
162-
163-
Python stores the columns of a data frame as either
164-
*lists* or *vectors*. For example, the data frame in Figure
165-
{numref}`fig:02-vectors` has three vectors whose names are `region`, `year` and
166-
`population`. The next two sections will explain what lists and vectors are.
167-
168-
```{figure} img/data_frame_slides_cdn/data_frame_slides_cdn.005.jpeg
169-
:name: fig:02-vectors
170-
:figclass: caption-hack
171-
172-
Data frame with three vectors.
173-
```
174-
175-
+++
176-
177117
### What is a series?
178118

179119
```{index} pandas.Series
180120
```
181121

182-
In Python, `pandas` **series** are arrays with labels. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, etc), including a mix of them (objects);
183-
Python has several different basic data types, as shown in {numref}`tab:datatype-table`.
184-
You can create a `pandas` series using the `pd.Series()` function. For
185-
example, to create the vector `region` as shown in
186-
{numref}`fig:02-series`, you can write:
122+
In Python, `pandas` **series** are lists. They are ordered and can be indexed.
123+
They are strictly 1-dimensional and can contain any data type
124+
(integers, strings, floats, etc), including a mix of them; Python
125+
has several different basic data types, as shown in
126+
{numref}`tab:datatype-table`.
127+
You can create a `pandas` series using the
128+
`pd.Series()` function. For example, to create the series `region` as shown
129+
in{numref}`fig:02-series`, you can write:
187130

188131
```{code-cell} ipython3
189132
import pandas as pd
133+
190134
region = pd.Series(["Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa"])
191135
region
192136
```
193137

138+
**(FIGURE 14 NEEDS UPDATING: (a) ZERO-BASED INDEXING, (b) TYPE SHOULD BE STRING (NOT CHARACTER))**
139+
194140
+++ {"tags": []}
195141

196142
```{figure} img/wrangling/pandas_dataframe_series.png
@@ -200,41 +146,6 @@ region
200146
Example of a `pandas` series whose type is string.
201147
```
202148

203-
+++ {"tags": ["remove-cell"]}
204-
205-
### What is a vector?
206-
207-
In R, **vectors** \index{vector}\index{atomic vector|see{vector}} are objects that can contain one or more elements. The vector
208-
elements are ordered, and they must all be of the same **data type**;
209-
R has several different basic data types, as shown in {numref}`tab:datatype-table`.
210-
Figure \@ref(fig:02-vector) provides an example of a vector where all of the elements are
211-
of character type.
212-
You can create vectors in R using the `c` function \index{c function} (`c` stands for "concatenate"). For
213-
example, to create the vector `region` as shown in Figure
214-
\@ref(fig:02-vector), you would write:
215-
216-
``` {r}
217-
year <- c("Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa")
218-
year
219-
```
220-
221-
> **Note:** Technically, these objects are called "atomic vectors." In this book
222-
> we have chosen to call them "vectors," which is how they are most commonly
223-
> referred to in the R community. To be totally precise, "vector" is an umbrella term that
224-
> encompasses both atomic vector and list objects in R. But this creates a
225-
> confusing situation where the term "vector" could
226-
> mean "atomic vector" *or* "the umbrella term for atomic vector and list,"
227-
> depending on context. Very confusing indeed! So to keep things simple, in
228-
> this book we *always* use the term "vector" to refer to "atomic vector."
229-
> We encourage readers who are enthusiastic to learn more to read the
230-
> Vectors chapter of *Advanced R* [@wickham2019advanced].
231-
232-
``` {r 02-vector, echo = FALSE, message = FALSE, warning = FALSE, fig.cap = "Example of a vector whose type is character.", fig.retina = 2, out.width = "100%"}
233-
image_read("img/data_frame_slides_cdn/data_frame_slides_cdn.007.jpeg") %>%
234-
image_crop("3632x590")
235-
```
236-
237-
+++
238149

239150
```{code-cell} ipython3
240151
:tags: [remove-cell]
@@ -265,76 +176,36 @@ image_read("img/data_frame_slides_cdn/data_frame_slides_cdn.007.jpeg") %>%
265176

266177
```{table} Basic data types in Python
267178
:name: tab:datatype-table
268-
| English name | Type name | Type Category | Description | Example |
269-
| :-------------------- | :--------- | :------------- | :-------------------------------------------- | :----------------------------------------- |
270-
| integer | `int` | Numeric Type | positive/negative whole numbers | `42` |
271-
| floating point number | `float` | Numeric Type | real number in decimal form | `3.14159` |
272-
| boolean | `bool` | Boolean Values | true or false | `True` |
273-
| string | `str` | Sequence Type | text | `"Can I have a cheezburger?"` |
274-
| list | `list` | Sequence Type | a collection of objects - mutable & ordered | `['Ali', 'Xinyi', 'Miriam']` |
275-
| tuple | `tuple` | Sequence Type | a collection of objects - immutable & ordered | `('Thursday', 6, 9, 2018)` |
276-
| dictionary | `dict` | Mapping Type | mapping of key-value pairs | `{'name':'DSCI', 'code':100, 'credits':2}` |
277-
| none | `NoneType` | Null Object | represents no value | `None` |
179+
| Data type | Abbreviation | Description | Example |
180+
| :-------------------- | :----------- | :-------------------------------------------- | :----------------------------------------- |
181+
| integer | `int` | positive/negative whole numbers | `42` |
182+
| floating point number | `float` | real number in decimal form | `3.14159` |
183+
| boolean | `bool` | true or false | `True` |
184+
| string | `str` | text | `"Can I have a cheezburger?"` |
185+
| none | `NoneType` | represents no value | `None` |
278186
```
279187

280188
+++
281189

282-
It is important in Python to make sure you represent your data with the correct type.
283-
Many of the `pandas` functions we use in this book treat
190+
It is important in Python to make sure you represent your data with the correct type.
191+
Many of the `pandas` functions we use in this book treat
284192
the various data types differently. You should use integers and float types
285193
(which both fall under the "numeric" umbrella type) to represent numbers and perform
286194
arithmetic. Strings are used to represent data that should
287-
be thought of as "text", such as words, names, paths, URLs, and more.
195+
be thought of as "text", such as words, names, paths, URLs, and more.
288196
There are other basic data types in Python, such as *set*
289197
and *complex*, but we do not use these in this textbook.
290198

291-
```{code-cell} ipython3
292-
:tags: [remove-cell]
293-
294-
# It is important in R to make sure you represent your data with the correct type.
295-
# Many of the `tidyverse` functions we use in this book treat
296-
# the various data types differently. You should use integers and double types
297-
# (which both fall under the "numeric" umbrella type) to represent numbers and perform
298-
# arithmetic. Doubles are more common than integers in R, though; for instance, a double data type is the
299-
# default when you create a vector of numbers using `c()`, and when you read in
300-
# whole numbers via `read_csv`. Characters are used to represent data that should
301-
# be thought of as "text", such as words, names, paths, URLs, and more. Factors help us
302-
# encode variables that represent *categories*; a factor variable takes one of a discrete
303-
# set of values known as *levels* (one for each category). The levels can be ordered or unordered. Even though
304-
# factors can sometimes *look* like characters, they are not used to represent
305-
# text, words, names, and paths in the way that characters are; in fact, R
306-
# internally stores factors using integers! There are other basic data types in R, such as *raw*
307-
# and *complex*, but we do not use these in this textbook.
308-
```
309-
310199
### What is a list?
311200

312201
```{index} list
313202
```
314203

315204
Lists are built-in objects in Python that have multiple, ordered elements.
316-
`pandas` series can be treated as lists with labels (indices).
205+
`pandas` series can be treated as an array with labels (indices).
317206

318-
```{code-cell} ipython3
319-
:tags: [remove-cell]
320-
321-
# Lists \index{list} are also objects in R that have multiple, ordered elements.
322-
# Vectors and lists differ by the requirement of element type
323-
# consistency. All elements within a single vector must be of the same type (e.g.,
324-
# all elements are characters), whereas elements within a single list can be of
325-
# different types (e.g., characters, integers, logicals, and even other lists).
326-
```
327-
328-
+++ {"tags": ["remove-cell"]}
329-
330-
```{figure} img/data_frame_slides_cdn/data_frame_slides_cdn.008.jpeg
331-
:name: fig:02-vec-vs-list
332-
:figclass: caption-hack
207+
**(FIGURE 3.4 FROM THE R-BOOK IS MISSING)**
333208

334-
A vector versus a list.
335-
```
336-
337-
+++
338209

339210
### What does this have to do with data frames?
340211

@@ -345,10 +216,10 @@ A vector versus a list.
345216

346217
A data frame is really just series stuck together that follows two rules:
347218

348-
1. Each element itself is a series.
219+
1. Each element itself is a series.
349220
2. Each element (series) must have the same length.
350221

351-
Not all columns in a data frame need to be of the same type.
222+
Not all columns in a data frame need to be of the same type.
352223
{numref}`fig:02-dataframe` shows a data frame where
353224
the columns are series of different types.
354225

@@ -361,23 +232,6 @@ the columns are series of different types.
361232
Data frame and vector types.
362233
```
363234

364-
```{code-cell} ipython3
365-
:tags: [remove-cell]
366-
367-
# A data frame \index{data frame!definition} is really a special kind of list that follows two rules:
368-
369-
# 1. Each element itself must either be a vector or a list.
370-
# 2. Each element (vector or list) must have the same length.
371-
372-
# Not all columns in a data frame need to be of the same type.
373-
# Figure \@ref(fig:02-dataframe) shows a data frame where
374-
# the columns are vectors of different types.
375-
# But remember: because the columns in this example are *vectors*,
376-
# the elements must be the same data type *within each column.*
377-
# On the other hand, if our data frame had *list* columns, there would be no such requirement.
378-
# It is generally much more common to use *vector* columns, though,
379-
# as the values for a single variable are usually all of the same type.
380-
```
381235

382236
```{index} type
383237
```
@@ -386,49 +240,29 @@ Data frame and vector types.
386240
> For example we can check the class of the Canadian languages data set,
387241
> `can_lang`, we worked with in the previous chapters and we see it is a `pandas.core.frame.DataFrame`.
388242
389-
```{code-cell} ipython3
390-
:tags: [remove-cell]
391-
392-
# The functions from the `tidyverse` package that we use often give us a
393-
# special class of data frame called a *tibble*. Tibbles have some additional \index{tibble}
394-
# features and benefits over the built-in data frame object. These include the
395-
# ability to add useful attributes (such as grouping, which we will discuss later)
396-
# and more predictable type preservation when subsetting.
397-
# Because a tibble is just a data frame with some added features,
398-
# we will collectively refer to both built-in R data frames and
399-
# tibbles as data frames in this book.
400-
401-
# > **Note:** You can use the function `class` \index{class} on a data object to assess whether a data
402-
# > frame is a built-in R data frame or a tibble. If the data object is a data
403-
# > frame, `class` will return `"data.frame"`. If the data object is a
404-
# > tibble it will return `"tbl_df" "tbl" "data.frame"`. You can easily convert
405-
# > built-in R data frames to tibbles using the `tidyverse` `as_tibble` function.
406-
# > For example we can check the class of the Canadian languages data set,
407-
# > `can_lang`, we worked with in the previous chapters and we see it is a tibble.
408-
```
409243

410244
```{code-cell} ipython3
411245
can_lang = pd.read_csv("data/can_lang.csv")
412246
type(can_lang)
413247
```
414248

415249
Lists, Series and DataFrames are basic types of *data structure* in Python, which
416-
are core to most data analyses. We summarize them in
417-
{numref}`tab:datastructure-table`. There are several other data structures in the Python programming
250+
are core to most data analyses. We summarize them in
251+
{numref}`tab:datastructure-table`. There are several other data structures in the Python programming
418252
language (*e.g.,* matrices), but these are beyond the scope of this book.
419253

420254
+++
421255

422-
```{table} Basic data structures in Python
256+
<!-- ```{table} Basic data structures in Python
423257
:name: tab:datastructure-table
424258
| Data Structure | Description |
425259
| --- |------------ |
426-
| list | An 1D ordered collection of values that can store multiple data types at once. |
427-
| Series | An 1D ordered collection of values *with labels* that can store multiple data types at once. |
260+
| list | A 1D ordered collection of values that can store multiple data types at once. |
261+
| Series | A 1D ordered collection of values *with labels* that can store multiple data types at once. |
428262
| DataFrame | A 2D labeled data structure with columns of potentially different types. |
429263
```
430264
431-
+++
265+
+++ -->
432266

433267
## Tidy data
434268

@@ -437,15 +271,15 @@ language (*e.g.,* matrices), but these are beyond the scope of this book.
437271

438272
There are many ways a tabular data set can be organized. This chapter will focus
439273
on introducing the **tidy data** format of organization and how to make your raw
440-
(and likely messy) data tidy. A tidy data frame satisfies
274+
(and likely messy) data tidy. A tidy data frame satisfies
441275
the following three criteria {cite:p}`wickham2014tidy`:
442276

443277
- each row is a single observation,
444278
- each column is a single variable, and
445279
- each value is a single cell (i.e., its entry in the data
446280
frame is not shared with another value).
447281

448-
{numref}`fig:02-tidy-image` demonstrates a tidy data set that satisfies these
282+
{numref}`fig:02-tidy-image` demonstrates a tidy data set that satisfies these
449283
three criteria.
450284

451285
+++ {"tags": []}
@@ -464,8 +298,8 @@ Tidy data satisfies three criteria.
464298

465299
There are many good reasons for making sure your data are tidy as a first step in your analysis.
466300
The most important is that it is a single, consistent format that nearly every function
467-
in the `pandas` recognizes. No matter what the variables and observations
468-
in your data represent, as long as the data frame
301+
in the `pandas` recognizes. No matter what the variables and observations
302+
in your data represent, as long as the data frame
469303
is tidy, you can manipulate it, plot it, and analyze it using the same tools.
470304
If your data is *not* tidy, you will have to write special bespoke code
471305
in your analysis that will not only be error-prone, but hard for others to understand.
@@ -491,18 +325,18 @@ below!
491325
```{index} pandas.DataFrame; melt
492326
```
493327

494-
One task that is commonly performed to get data into a tidy format
495-
is to combine values that are stored in separate columns,
328+
One task that is commonly performed to get data into a tidy format
329+
is to combine values that are stored in separate columns,
496330
but are really part of the same variable, into one.
497-
Data is often stored this way
498-
because this format is sometimes more intuitive for human readability
331+
Data is often stored this way
332+
because this format is sometimes more intuitive for human readability
499333
and understanding, and humans create data sets.
500-
In {numref}`fig:02-wide-to-long`,
501-
the table on the left is in an untidy, "wide" format because the year values
502-
(2006, 2011, 2016) are stored as column names.
503-
And as a consequence,
504-
the values for population for the various cities
505-
over these years are also split across several columns.
334+
In {numref}`fig:02-wide-to-long`,
335+
the table on the left is in an untidy, "wide" format because the year values
336+
(2006, 2011, 2016) are stored as column names.
337+
And as a consequence,
338+
the values for population for the various cities
339+
over these years are also split across several columns.
506340

507341
For humans, this table is easy to read, which is why you will often find data
508342
stored in this wide format. However, this format is difficult to work with
@@ -518,13 +352,16 @@ greatly simplified once the data is tidied.
518352

519353
Another problem with data in this format is that we don't know what the
520354
numbers under each year actually represent. Do those numbers represent
521-
population size? Land area? It's not clear.
522-
To solve both of these problems,
523-
we can reshape this data set to a tidy data format
355+
population size? Land area? It's not clear.
356+
To solve both of these problems,
357+
we can reshape this data set to a tidy data format
524358
by creating a column called "year" and a column called
525359
"population." This transformation&mdash;which makes the data
526360
"longer"&mdash;is shown as the right table in
527-
{numref}`fig:02-wide-to-long`.
361+
{numref}`fig:02-wide-to-long`. Note that the number of entries in our data frame
362+
can change in this transformation. The "untidy" data has 5 rows and 3 columns for
363+
a total of 15 data, whereas the "tidy" data on the right has 15 rows and 2 columns
364+
for a total of 30 data.
528365

529366
+++ {"tags": []}
530367

@@ -541,41 +378,42 @@ Melting data from a wide to long data format.
541378
```
542379

543380
We can achieve this effect in Python using the `.melt` function from the `pandas` package.
544-
The `.melt` function combines columns,
545-
and is usually used during tidying data
546-
when we need to make the data frame longer and narrower.
381+
We say that we "melt" (or "pivot") the wide table into a longer format.
382+
The `.melt` function combines columns,
383+
and is usually used during tidying data
384+
when we need to make the data frame longer and narrower.
547385
To learn how to use `.melt`, we will work through an example with the
548386
`region_lang_top5_cities_wide.csv` data set. This data set contains the
549-
counts of how many Canadians cited each language as their mother tongue for five
387+
counts of how many Canadians cited each language as their mother tongue for five
550388
major Canadian cities (Toronto, Montréal, Vancouver, Calgary and Edmonton) from
551-
the 2016 Canadian census.
552-
To get started,
389+
the 2016 Canadian census.
390+
To get started,
553391
we will use `pd.read_csv` to load the (untidy) data.
554392

555393
```{code-cell} ipython3
556394
lang_wide = pd.read_csv("data/region_lang_top5_cities_wide.csv")
557395
lang_wide
558396
```
559397

560-
What is wrong with the untidy format above?
561-
The table on the left in {numref}`fig:img-pivot-longer-with-table`
398+
What is wrong with the untidy format above?
399+
The table on the left in {numref}`fig:img-pivot-longer-with-table`
562400
represents the data in the "wide" (messy) format.
563-
From a data analysis perspective, this format is not ideal because the values of
564-
the variable *region* (Toronto, Montréal, Vancouver, Calgary and Edmonton)
401+
From a data analysis perspective, this format is not ideal because the values of
402+
the variable *region* (Toronto, Montréal, Vancouver, Calgary and Edmonton)
565403
are stored as column names. Thus they
566404
are not easily accessible to the data analysis functions we will apply
567405
to our data set. Additionally, the *mother tongue* variable values are
568406
spread across multiple columns, which will prevent us from doing any desired
569407
visualization or statistical tasks until we combine them into one column. For
570-
instance, suppose we want to know the languages with the highest number of
408+
instance, suppose we want to know the languages with the highest number of
571409
Canadians reporting it as their mother tongue among all five regions. This
572-
question would be tough to answer with the data in its current format.
573-
We *could* find the answer with the data in this format,
410+
question would be tough to answer with the data in its current format.
411+
We *could* find the answer with the data in this format,
574412
though it would be much easier to answer if we tidy our
575-
data first. If mother tongue were instead stored as one column,
576-
as shown in the tidy data on the right in
413+
data first. If mother tongue were instead stored as one column,
414+
as shown in the tidy data on the right in
577415
{numref}`fig:img-pivot-longer-with-table`,
578-
we could simply use one line of code (`df["mother_tongue"].max()`)
416+
we could simply use one line of code (`df["mother_tongue"].max()`)
579417
to get the maximum value.
580418

581419
+++ {"tags": []}
@@ -589,7 +427,7 @@ Going from wide to long with the `.melt` function.
589427

590428
+++
591429

592-
{numref}`fig:img-pivot-longer` details the arguments that we need to specify
430+
{numref}`fig:img-pivot-longer` details the arguments that we need to specify
593431
in the `.melt` function to accomplish this data transformation.
594432

595433
+++ {"tags": []}
@@ -613,25 +451,26 @@ We use `.melt` to combine the Toronto, Montréal,
613451
Vancouver, Calgary, and Edmonton columns into a single column called `region`,
614452
and create a column called `mother_tongue` that contains the count of how many
615453
Canadians report each language as their mother tongue for each metropolitan
616-
area. We specify `value_vars` to be all
617-
the columns between Toronto and Edmonton:
454+
area:
618455

619456
```{code-cell} ipython3
620457
lang_mother_tidy = lang_wide.melt(
621458
id_vars=["category", "language"],
622-
value_vars=["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"],
623459
var_name="region",
624460
value_name="mother_tongue",
625461
)
626462
627463
lang_mother_tidy
628464
```
629465

466+
**(FIGURE 3.9 FROM THE R BOOK IS MISSING)**
467+
630468
> **Note**: In the code above, the call to the
631469
> `.melt` function is split across several lines. This is allowed in
632-
> certain cases; for example, when calling a function as above, as long as the
633-
> line ends with a comma `,` Python knows to keep reading on the next line.
634-
> Splitting long lines like this across multiple lines is encouraged
470+
> certain cases; for example, when calling a function as above, the input
471+
> arguments are between parentheses `()` and Python knows to keep reading on
472+
> the next line. Each line ends with a comma `,` making it easier to read.
473+
> Splitting long lines like this across multiple lines is encouraged
635474
> as it helps significantly with code readability. Generally speaking, you should
636475
> limit each line of code to about 80 characters.
637476
@@ -656,17 +495,17 @@ been met:
656495
Suppose we have observations spread across multiple rows rather than in a single
657496
row. For example, in {numref}`fig:long-to-wide`, the table on the left is in an
658497
untidy, long format because the `count` column contains three variables
659-
(population, commuter, and incorporated count) and information about each observation
660-
(here, population, commuter, and incorporated counts for a region) is split across three rows.
661-
Remember: one of the criteria for tidy data
498+
(population, commuter, and incorporated count) and information about each observation
499+
(here, population, commuter, and incorporated counts for a region) is split across three rows.
500+
Remember: one of the criteria for tidy data
662501
is that each observation must be in a single row.
663502

664503
Using data in this format&mdash;where two or more variables are mixed together
665504
in a single column&mdash;makes it harder to apply many usual `pandas` functions.
666-
For example, finding the maximum number of commuters
505+
For example, finding the maximum number of commuters
667506
would require an additional step of filtering for the commuter values
668507
before the maximum can be computed.
669-
In comparison, if the data were tidy,
508+
In comparison, if the data were tidy,
670509
all we would have to do is compute the maximum value for the commuter column.
671510
To reshape this untidy data set to a tidy (and in this case, wider) format,
672511
we need to create columns called "population", "commuters", and "incorporated."
@@ -684,12 +523,12 @@ Going from long to wide data.
684523
+++
685524

686525
To tidy this type of data in Python, we can use the `.pivot` function.
687-
The `.pivot` function generally increases the number of columns (widens)
688-
and decreases the number of rows in a data set.
689-
To learn how to use `.pivot`,
690-
we will work through an example
691-
with the `region_lang_top5_cities_long.csv` data set.
692-
This data set contains the number of Canadians reporting
526+
The `.pivot` function generally increases the number of columns (widens)
527+
and decreases the number of rows in a data set.
528+
To learn how to use `.pivot`,
529+
we will work through an example
530+
with the `region_lang_top5_cities_long.csv` data set.
531+
This data set contains the number of Canadians reporting
693532
the primary language at home and work for five
694533
major cities (Toronto, Montréal, Vancouver, Calgary and Edmonton).
695534

@@ -698,14 +537,14 @@ lang_long = pd.read_csv("data/region_lang_top5_cities_long.csv")
698537
lang_long
699538
```
700539

701-
What makes the data set shown above untidy?
702-
In this example, each observation is a language in a region.
703-
However, each observation is split across multiple rows:
704-
one where the count for `most_at_home` is recorded,
705-
and the other where the count for `most_at_work` is recorded.
706-
Suppose the goal with this data was to
540+
What makes the data set shown above untidy?
541+
In this example, each observation is a language in a region.
542+
However, each observation is split across multiple rows:
543+
one where the count for `most_at_home` is recorded,
544+
and the other where the count for `most_at_work` is recorded.
545+
Suppose the goal with this data was to
707546
visualize the relationship between the number of
708-
Canadians reporting their primary language at home and work.
547+
Canadians reporting their primary language at home and work.
709548
Doing that would be difficult with this data in its current form,
710549
since these two variables are stored in the same column.
711550
{numref}`fig:img-pivot-wider-table` shows how this data
@@ -722,7 +561,7 @@ Going from long to wide with the `.pivot` function.
722561

723562
+++
724563

725-
{numref}`fig:img-pivot-wider` details the arguments that we need to specify
564+
{numref}`fig:img-pivot-wider` details the arguments that we need to specify
726565
in the `.pivot` function.
727566

728567
+++ {"tags": []}
@@ -754,7 +593,7 @@ lang_home_tidy
754593
```
755594

756595
```{code-cell} ipython3
757-
lang_home_tidy.dtypes
596+
lang_home_tidy.info()
758597
```
759598

760599
The data above is now tidy! We can go through the three criteria again to check
@@ -781,11 +620,11 @@ more columns, and we would see the data set "widen."
781620
```{index} pandas.Series; str.split, delimiter
782621
```
783622

784-
Data are also not considered tidy when multiple values are stored in the same
623+
Data are also not considered tidy when multiple values are stored in the same
785624
cell. The data set we show below is even messier than the ones we dealt with
786625
above: the `Toronto`, `Montréal`, `Vancouver`, `Calgary` and `Edmonton` columns
787626
contain the number of Canadians reporting their primary language at home and
788-
work in one column separated by the delimiter (`/`). The column names are the
627+
work in one column separated by the delimiter (`/`). The column names are the
789628
values of a variable, *and* each value does not have its own cell! To turn this
790629
messy data into tidy data, we'll have to fix these issues.
791630

@@ -795,28 +634,34 @@ lang_messy
795634
```
796635

797636
First we’ll use `.melt` to create two columns, `region` and `value`,
798-
similar to what we did previously.
637+
similar to what we did previously.
799638
The new `region` columns will contain the region names,
800-
and the new column `value` will be a temporary holding place for the
801-
data that we need to further separate, i.e., the
639+
and the new column `value` will be a temporary holding place for the
640+
data that we need to further separate, i.e., the
802641
number of Canadians reporting their primary language at home and work.
803642

804643
```{code-cell} ipython3
805644
lang_messy_longer = lang_messy.melt(
806645
id_vars=["category", "language"],
807-
value_vars=["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"],
808646
var_name="region",
809647
value_name="value",
810648
)
811649
812650
lang_messy_longer
813651
```
814652

815-
Next we'll use `.str.split` to split the `value` column into two columns.
816-
One column will contain only the counts of Canadians
817-
that speak each language most at home,
818-
and the other will contain the counts of Canadians
819-
that speak each language most at work for each region.
653+
Next we'll use `.str.split` to split the `value` column into two columns.
654+
How it works is that it takes a single string and splits it into multiple values
655+
based on the character you tell it to split on. For example:
656+
```{code-cell} ipython3
657+
"50/0".split("/")
658+
```
659+
660+
We can use this operation on the columns of our data frame so that
661+
one column will contain only the counts of Canadians
662+
that speak each language most at home,
663+
and the other will contain the counts of Canadians
664+
that speak each language most at work for each region.
820665
{numref}`fig:img-separate`
821666
outlines what we need to specify to use `.str.split`.
822667

@@ -843,7 +688,7 @@ tidy_lang
843688
```
844689

845690
```{code-cell} ipython3
846-
tidy_lang.dtypes
691+
tidy_lang.info()
847692
```
848693

849694
Is this data set now tidy? If we recall the three criteria for tidy data:
@@ -856,17 +701,17 @@ We can see that this data now satisfies all three criteria, making it easier to
856701
analyze. But we aren't done yet! Notice in the table, all of the variables are
857702
"object" data types. Object data types are columns of strings or columns with mixed types. In the previous example in Section {ref}`pivot-wider`, the
858703
`most_at_home` and `most_at_work` variables were `int64` (integer)&mdash;you can
859-
verify this by calling `df.dtypes`&mdash;which is a type
704+
verify this by calling `df.info()`&mdash;which is a type
860705
of numeric data. This change is due to the delimiter (`/`) when we read in this
861706
messy data set. Python read these columns in as string types, and by default,
862707
`.str.split` will return columns as object data types.
863708

864709
It makes sense for `region`, `category`, and `language` to be stored as a
865710
object type. However, suppose we want to apply any functions that treat the
866-
`most_at_home` and `most_at_work` columns as a number (e.g., finding rows
867-
above a numeric threshold of a column).
868-
In that case,
869-
it won't be possible to do if the variable is stored as a `object`.
711+
`most_at_home` and `most_at_work` columns as a number (e.g., finding rows
712+
above a numeric threshold of a column).
713+
In that case,
714+
it won't be possible to do if the variable is stored as a `object`.
870715
Fortunately, the `pandas.to_numeric` function provides a natural way to fix problems
871716
like this: it will convert the columns to the best numeric data types.
872717

@@ -887,12 +732,12 @@ like this: it will convert the columns to the best numeric data types.
887732
888733
# It makes sense for `region`, `category`, and `language` to be stored as a
889734
# character (or perhaps factor) type. However, suppose we want to apply any functions that treat the
890-
# `most_at_home` and `most_at_work` columns as a number (e.g., finding rows
891-
# above a numeric threshold of a column).
892-
# In that case,
893-
# it won't be possible to do if the variable is stored as a `character`.
735+
# `most_at_home` and `most_at_work` columns as a number (e.g., finding rows
736+
# above a numeric threshold of a column).
737+
# In that case,
738+
# it won't be possible to do if the variable is stored as a `character`.
894739
# Fortunately, the `separate` function provides a natural way to fix problems
895-
# like this: we can set `convert = TRUE` to convert the `most_at_home`
740+
# like this: we can set `convert = TRUE` to convert the `most_at_home`
896741
# and `most_at_work` columns to the correct data type.
897742
```
898743

@@ -903,126 +748,38 @@ tidy_lang
903748
```
904749

905750
```{code-cell} ipython3
906-
tidy_lang.dtypes
751+
tidy_lang.info()
907752
```
908753

909754
Now we see `most_at_home` and `most_at_work` columns are of `int64` data types,
910755
indicating they are integer data types (i.e., numbers)!
911756

912757
+++
913758

914-
(loc-iloc)=
915-
## Using `.loc[]` and `.iloc[]` to extract a range of columns
916-
917-
```{index} pandas.DataFrame; loc[]
918-
```
919-
920-
Now that the `tidy_lang` data is indeed *tidy*, we can start manipulating it
921-
using the powerful suite of functions from the `pandas`.
922-
For the first example, recall `.loc[]` from Chapter {ref}`intro`,
923-
which lets us create a subset of columns from a data frame.
924-
Suppose we wanted to select only the columns `language`, `region`,
925-
`most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we
926-
learned in Chapter {ref}`intro`, we would pass all of these column names into the square brackets:
927-
928-
```{code-cell} ipython3
929-
selected_columns = tidy_lang.loc[:, ["language", "region", "most_at_home", "most_at_work"]]
930-
selected_columns
931-
```
932-
933-
```{index} pandas.DataFrame; iloc[], column range
934-
```
935-
936-
Here we wrote out the names of each of the columns. However, this method is
937-
time-consuming, especially if you have a lot of columns! Another approach is to
938-
index with integers. `.iloc[]` make it easier for
939-
us to select columns. For instance, we can use `.iloc[]` to choose a
940-
range of columns rather than typing each column name out. To do this, we use the
941-
colon (`:`) operator to denote the range. For example, to get all the columns in
942-
the `tidy_lang` data frame from `language` to `most_at_work`, we pass `:` before the comma indicating we want to retrieve all rows, and `1:` after the comma indicating we want only columns from index 1 (*i.e.* `language`) and afterwords.
943-
944-
```{code-cell} ipython3
945-
:tags: [remove-cell]
946-
947-
# Here we wrote out the names of each of the columns. However, this method is
948-
# time-consuming, especially if you have a lot of columns! Another approach is to
949-
# use a "select helper". Select helpers are operators that make it easier for
950-
# us to select columns. For instance, we can use a select helper to choose a
951-
# range of columns rather than typing each column name out. To do this, we use the
952-
# colon (`:`) operator to denote the range. For example, to get all the columns in \index{column range}
953-
# the `tidy_lang` data frame from `language` to `most_at_work` we pass
954-
# `language:most_at_work` as the second argument to the `select` function.
955-
```
956-
957-
```{code-cell} ipython3
958-
column_range = tidy_lang.iloc[:, 1:]
959-
column_range
960-
```
961-
962-
Notice that we get the same output as we did above,
963-
but with less (and clearer!) code. This type of operator
964-
is especially handy for large data sets.
965-
966-
```{index} pandas.Series; str.startswith
967-
```
968-
969-
Suppose instead we wanted to extract columns that followed a particular pattern
970-
rather than just selecting a range. For example, let's say we wanted only to select the
971-
columns `most_at_home` and `most_at_work`. There are other functions that allow
972-
us to select variables based on their names. In particular, we can use the `.str.startswith` method
973-
to choose only the columns that start with the word "most":
759+
## Using `[]` to extract rows or columns
974760

975-
```{code-cell} ipython3
976-
tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]
977-
```
978-
979-
```{index} pandas.Series; str.contains
980-
```
981-
982-
We could also have chosen the columns containing an underscore `_` by using the
983-
`.str.contains("_")`, since we notice
984-
the columns we want contain underscores and the others don't.
985-
986-
```{code-cell} ipython3
987-
tidy_lang.loc[:, tidy_lang.columns.str.contains('_')]
988-
```
989-
990-
There are many different functions that help with selecting
991-
variables based on certain criteria.
992-
The additional resources section at the end of this chapter
993-
provides a comprehensive resource on these functions.
994-
995-
```{code-cell} ipython3
996-
:tags: [remove-cell]
997-
998-
# There are many different `select` helpers that select
999-
# variables based on certain criteria.
1000-
# The additional resources section at the end of this chapter
1001-
# provides a comprehensive resource on `select` helpers.
1002-
```
1003-
1004-
## Using `df[]` to extract rows
1005-
1006-
Next, we revisit the `df[]` from Chapter {ref}`intro`,
1007-
which lets us create a subset of rows from a data frame.
1008-
Recall the argument to the `df[]`:
761+
Now that the `tidy_lang` data is indeed *tidy*, we can start manipulating it
762+
using the powerful suite of functions from the `pandas`.
763+
We revisit the `[]` from Chapter {ref}`intro`,
764+
which lets us create a subset of rows from a data frame.
765+
Recall the argument to the `[]`:
1009766
column names or a logical statement evaluated to either `True` or `False`;
1010-
`df[]` works by returning the rows where the logical statement evaluates to `True`.
1011-
This section will highlight more advanced usage of the `df[]` function.
767+
`[]` works by returning the rows where the logical statement evaluates to `True`.
768+
This section will highlight more advanced usage of the `[]` function.
1012769
In particular, this section provides an in-depth treatment of the variety of logical statements
1013-
one can use in the `df[]` to select subsets of rows.
770+
one can use in the `[]` to select subsets of rows.
1014771

1015772
+++
1016773

1017774
### Extracting rows that have a certain value with `==`
1018775
Suppose we are only interested in the subset of rows in `tidy_lang` corresponding to the
1019776
official languages of Canada (English and French).
1020-
We can extract these rows by using the *equivalency operator* (`==`)
1021-
to compare the values of the `category` column
1022-
with the value `"Official languages"`.
1023-
With these arguments, `df[]` returns a data frame with all the columns
1024-
of the input data frame
1025-
but only the rows we asked for in the logical statement, i.e.,
777+
We can extract these rows by using the *equivalency operator* (`==`)
778+
to compare the values of the `category` column
779+
with the value `"Official languages"`.
780+
With these arguments, `[]` returns a data frame with all the columns
781+
of the input data frame
782+
but only the rows we asked for in the logical statement, i.e.,
1026783
those where the `category` column holds the value `"Official languages"`.
1027784
We name this data frame `official_langs`.
1028785

@@ -1034,7 +791,7 @@ official_langs
1034791
### Extracting rows that do not have a certain value with `!=`
1035792

1036793
What if we want all the other language categories in the data set *except* for
1037-
those in the `"Official languages"` category? We can accomplish this with the `!=`
794+
those in the `"Official languages"` category? We can accomplish this with the `!=`
1038795
operator, which means "not equal to". So if we want to find all the rows
1039796
where the `category` does *not* equal `"Official languages"` we write the code
1040797
below.
@@ -1046,14 +803,14 @@ tidy_lang[tidy_lang["category"] != "Official languages"]
1046803
(filter-and)=
1047804
### Extracting rows satisfying multiple conditions using `&`
1048805

1049-
Suppose now we want to look at only the rows
1050-
for the French language in Montréal.
1051-
To do this, we need to filter the data set
1052-
to find rows that satisfy multiple conditions simultaneously.
806+
Suppose now we want to look at only the rows
807+
for the French language in Montréal.
808+
To do this, we need to filter the data set
809+
to find rows that satisfy multiple conditions simultaneously.
1053810
We can do this with the ampersand symbol (`&`), which
1054-
is interpreted by Python as "and".
1055-
We write the code as shown below to filter the `official_langs` data frame
1056-
to subset the rows where `region == "Montréal"`
811+
is interpreted by Python as "and".
812+
We write the code as shown below to filter the `official_langs` data frame
813+
to subset the rows where `region == "Montréal"`
1057814
*and* the `language == "French"`.
1058815

1059816
```{code-cell} ipython3
@@ -1065,12 +822,12 @@ tidy_lang[(tidy_lang["region"] == "Montréal") & (tidy_lang["language"] == "Fren
1065822
### Extracting rows satisfying at least one condition using `|`
1066823

1067824
Suppose we were interested in only those rows corresponding to cities in Alberta
1068-
in the `official_langs` data set (Edmonton and Calgary).
825+
in the `official_langs` data set (Edmonton and Calgary).
1069826
We can't use `&` as we did above because `region`
1070-
cannot be both Edmonton *and* Calgary simultaneously.
1071-
Instead, we can use the vertical pipe (`|`) logical operator,
1072-
which gives us the cases where one condition *or*
1073-
another condition *or* both are satisfied.
827+
cannot be both Edmonton *and* Calgary simultaneously.
828+
Instead, we can use the vertical pipe (`|`) logical operator,
829+
which gives us the cases where one condition *or*
830+
another condition *or* both are satisfied.
1074831
In the code below, we ask Python to return the rows
1075832
where the `region` columns are equal to "Calgary" *or* "Edmonton".
1076833

@@ -1082,20 +839,20 @@ official_langs[
1082839

1083840
### Extracting rows with values in a list using `.isin()`
1084841

1085-
Next, suppose we want to see the populations of our five cities.
1086-
Let's read in the `region_data.csv` file
1087-
that comes from the 2016 Canadian census,
1088-
as it contains statistics for number of households, land area, population
842+
Next, suppose we want to see the populations of our five cities.
843+
Let's read in the `region_data.csv` file
844+
that comes from the 2016 Canadian census,
845+
as it contains statistics for number of households, land area, population
1089846
and number of dwellings for different regions.
1090847

1091848
```{code-cell} ipython3
1092849
region_data = pd.read_csv("data/region_data.csv")
1093850
region_data
1094851
```
1095852

1096-
To get the population of the five cities
1097-
we can filter the data set using the `.isin` method.
1098-
The `.isin` method is used to see if an element belongs to a list.
853+
To get the population of the five cities
854+
we can filter the data set using the `.isin` method.
855+
The `.isin` method is used to see if an element belongs to a list.
1099856
Here we are filtering for rows where the value in the `region` column
1100857
matches any of the five cities we are intersted in: Toronto, Montréal,
1101858
Vancouver, Calgary, and Edmonton.
@@ -1136,7 +893,7 @@ pd.Series(["Vancouver", "Toronto"]).isin(pd.Series(["Toronto", "Vancouver"]))
1136893
# > elements in `vectorB`. Then the second element of `vectorA` is compared
1137894
# > to all the elements in `vectorB`, and so on. Notice the difference between `==` and
1138895
# > `%in%` in the example below.
1139-
# >
896+
# >
1140897
# >``` {r}
1141898
# >c("Vancouver", "Toronto") == c("Toronto", "Vancouver")
1142899
# >c("Vancouver", "Toronto") %in% c("Toronto", "Vancouver")
@@ -1152,25 +909,135 @@ glue("census_popn", "{0:,.0f}".format(35151728))
1152909
glue("most_french", "{0:,.0f}".format(2669195))
1153910
```
1154911

1155-
We saw in Section {ref}`filter-and` that
1156-
{glue:text}`most_french` people reported
1157-
speaking French in Montréal as their primary language at home.
1158-
If we are interested in finding the official languages in regions
1159-
with higher numbers of people who speak it as their primary language at home
1160-
compared to French in Montréal, then we can use `df[]` to obtain rows
1161-
where the value of `most_at_home` is greater than
912+
We saw in Section {ref}`filter-and` that
913+
{glue:text}`most_french` people reported
914+
speaking French in Montréal as their primary language at home.
915+
If we are interested in finding the official languages in regions
916+
with higher numbers of people who speak it as their primary language at home
917+
compared to French in Montréal, then we can use `[]` to obtain rows
918+
where the value of `most_at_home` is greater than
1162919
{glue:text}`most_french`.
1163920

1164921
```{code-cell} ipython3
1165922
official_langs[official_langs["most_at_home"] > 2669195]
1166923
```
1167924

1168-
This operation returns a data frame with only one row, indicating that when
1169-
considering the official languages,
1170-
only English in Toronto is reported by more people
1171-
as their primary language at home
925+
This operation returns a data frame with only one row, indicating that when
926+
considering the official languages,
927+
only English in Toronto is reported by more people
928+
as their primary language at home
1172929
than French in Montréal according to the 2016 Canadian census.
1173930

931+
(loc-iloc)=
932+
## Using `.loc[]` to filter rows and select columns.
933+
```{index} pandas.DataFrame; loc[]
934+
```
935+
936+
The `[]` operation is only used when you want to filter rows or select columns;
937+
it cannot be used to do both operations at the same time. This is where `.loc[]`
938+
comes in. For the first example, recall `.loc[]` from Chapter {ref}`intro`,
939+
which lets us create a subset of columns from a data frame.
940+
Suppose we wanted to select only the columns `language`, `region`,
941+
`most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we
942+
learned in Chapter {ref}`intro`, we would pass all of these column names into the square brackets:
943+
944+
```{code-cell} ipython3
945+
selected_columns = tidy_lang.loc[:, ["language", "region", "most_at_home", "most_at_work"]]
946+
selected_columns
947+
```
948+
We pass `:` before the comma indicating we want to retrieve all rows, and the list indicates
949+
the columns that we want.
950+
951+
Note that we could obtain the same result by stating that we would like all of the columns
952+
from `language` through `most_at_work`. Instead of passing a list of all of the column
953+
names that we want, we can ask for the range of columns `"language":"most_at_work"`, which
954+
you can read as "The columns from `language` to (`:`) `most_at_work`.
955+
956+
```{code-cell} ipython3
957+
selected_columns = tidy_lang.loc[:, "language":"most_at_work"]
958+
selected_columns
959+
```
960+
961+
Similarly, you can ask for all of the columns including and after `language` by doing the following
962+
963+
```{code-cell} ipython3
964+
selected_columns = tidy_lang.loc[:, "language":]
965+
selected_columns
966+
```
967+
968+
By not putting anything after the `:`, python reads this as "from `language` until the last column".
969+
Although the notation for selecting a range using `:` is convienent because less code is required,
970+
it must be used carefully. If you were to re-order columns or add a column to the data frame, the
971+
output would change. Using a list is more explicit and less prone to potential confusion.
972+
973+
Suppose instead we wanted to extract columns that followed a particular pattern
974+
rather than just selecting a range. For example, let's say we wanted only to select the
975+
columns `most_at_home` and `most_at_work`. There are other functions that allow
976+
us to select variables based on their names. In particular, we can use the `.str.startswith` method
977+
to choose only the columns that start with the word "most":
978+
979+
```{code-cell} ipython3
980+
tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]
981+
```
982+
983+
```{index} pandas.Series; str.contains
984+
```
985+
986+
We could also have chosen the columns containing an underscore `_` by using the
987+
`.str.contains("_")`, since we notice
988+
the columns we want contain underscores and the others don't.
989+
990+
```{code-cell} ipython3
991+
tidy_lang.loc[:, tidy_lang.columns.str.contains('_')]
992+
```
993+
994+
There are many different functions that help with selecting
995+
variables based on certain criteria.
996+
The additional resources section at the end of this chapter
997+
provides a comprehensive resource on these functions.
998+
999+
```{code-cell} ipython3
1000+
:tags: [remove-cell]
1001+
1002+
# There are many different `select` helpers that select
1003+
# variables based on certain criteria.
1004+
# The additional resources section at the end of this chapter
1005+
# provides a comprehensive resource on `select` helpers.
1006+
```
1007+
1008+
## Using `.iloc[]` to extract a range of columns
1009+
```{index} pandas.DataFrame; iloc[], column range
1010+
```
1011+
Another approach for selecting columns is to use `.iloc[]`
1012+
which allows us to index with integers rather than the names of the columns.
1013+
For example, the column names of the `tidy_lang` data frame are
1014+
`['category', 'language', 'region', 'most_at_home', 'most_at_work']`.
1015+
1016+
Then using `.iloc[]` you can ask for the `language` column by doing
1017+
1018+
```{code-cell} ipython3
1019+
column = tidy_lang.iloc[:, 1]
1020+
column
1021+
```
1022+
1023+
You can also ask for multiple columns as we did with `[]`. We pass `:` before
1024+
the comma indicating we want to retrieve all rows, and `1:` after the comma
1025+
indicating we want only columns from index 1 (*i.e.* `language`) and afterwords.
1026+
1027+
```{code-cell} ipython3
1028+
column_range = tidy_lang.iloc[:, 1:]
1029+
column_range
1030+
```
1031+
1032+
This is less commonly used and needs to be used with care; it is easy
1033+
accidentally put in the wrong integer because you didn't remember if `language`
1034+
was column number 1 or 2.
1035+
1036+
Notice that we get the same output as we did
1037+
1038+
```{index} pandas.Series; str.startswith
1039+
```
1040+
11741041
+++ {"tags": []}
11751042

11761043
(pandas-assign)=
@@ -1180,28 +1047,27 @@ than French in Montréal according to the 2016 Canadian census.
11801047

11811048
### Using `.assign` to modify columns
11821049

1183-
```{index} pandas.DataFrame; df[]
1050+
```{index} pandas.DataFrame; []
11841051
```
11851052

1186-
In Section {ref}`str-split`,
1053+
In Section {ref}`str-split`,
11871054
when we first read in the `"region_lang_top5_cities_messy.csv"` data,
1188-
all of the variables were "object" data types.
1189-
During the tidying process,
1190-
we used the `pandas.to_numeric` function
1191-
to convert the `most_at_home` and `most_at_work` columns
1192-
to the desired integer (i.e., numeric class) data types and then used `df[]` to overwrite columns.
1193-
But suppose we didn't use the `df[]`,
1055+
all of the variables were "object" data types.
1056+
During the tidying process,
1057+
we used the `pandas.to_numeric` function
1058+
to convert the `most_at_home` and `most_at_work` columns
1059+
to the desired integer (i.e., numeric class) data types and then used `[]` to overwrite columns.
1060+
But suppose we didn't use the `[]`,
11941061
and needed to modify the columns some other way.
1195-
Below we create such a situation
1062+
Below we create such a situation
11961063
so that we can demonstrate how to use `.assign`
1197-
to change the column types of a data frame.
1064+
to change the column types of a data frame.
11981065
`.assign` is a useful function to modify or create new data frame columns.
11991066

12001067
```{code-cell} ipython3
12011068
lang_messy = pd.read_csv("data/region_lang_top5_cities_messy.csv")
12021069
lang_messy_longer = lang_messy.melt(
12031070
id_vars=["category", "language"],
1204-
value_vars=["Toronto", "Montréal", "Vancouver", "Calgary", "Edmonton"],
12051071
var_name="region",
12061072
value_name="value",
12071073
)
@@ -1219,23 +1085,23 @@ official_langs_obj
12191085
```
12201086

12211087
```{code-cell} ipython3
1222-
official_langs_obj.dtypes
1088+
official_langs_obj.info()
12231089
```
12241090

1225-
To use the `.assign` method, again we first specify the object to be the data set,
1226-
and in the following arguments,
1227-
we specify the name of the column we want to modify or create
1091+
To use the `.assign` method, again we first specify the object to be the data set,
1092+
and in the following arguments,
1093+
we specify the name of the column we want to modify or create
12281094
(here `most_at_home` and `most_at_work`), an `=` sign,
12291095
and then the function we want to apply (here `pandas.to_numeric`).
1230-
In the function we want to apply,
1231-
we refer to the column upon which we want it to act
1096+
In the function we want to apply,
1097+
we refer to the column upon which we want it to act
12321098
(here `most_at_home` and `most_at_work`).
12331099
In our example, we are naming the columns the same
1234-
names as columns that already exist in the data frame
1235-
("most\_at\_home", "most\_at\_work")
1236-
and this will cause `.assign` to *overwrite* those columns
1100+
names as columns that already exist in the data frame
1101+
("most\_at\_home", "most\_at\_work")
1102+
and this will cause `.assign` to *overwrite* those columns
12371103
(also referred to as modifying those columns *in-place*).
1238-
If we were to give the columns a new name,
1104+
If we were to give the columns a new name,
12391105
then `.assign` would create new columns with the names we specified.
12401106
`.assign`'s general syntax is detailed in {numref}`fig:img-assign`.
12411107

@@ -1251,7 +1117,7 @@ Syntax for the `.assign` function.
12511117
+++
12521118

12531119
Below we use `.assign` to convert the columns `most_at_home` and `most_at_work`
1254-
to numeric data types in the `official_langs` data set as described in
1120+
to numeric data types in the `official_langs` data set as described in
12551121
{numref}`fig:img-assign`:
12561122

12571123
```{code-cell} ipython3
@@ -1264,7 +1130,7 @@ official_langs_numeric
12641130
```
12651131

12661132
```{code-cell} ipython3
1267-
official_langs_numeric.dtypes
1133+
official_langs_numeric.info()
12681134
```
12691135

12701136
Now we see that the `most_at_home` and `most_at_work` columns are both `int64` (which is a numeric data type)!
@@ -1297,26 +1163,26 @@ the 2016 Canadian census. What does this number mean to us? To understand this
12971163
number, we need context. In particular, how many people were in Toronto when
12981164
this data was collected? From the 2016 Canadian census profile, the population
12991165
of Toronto was reported to be
1300-
{glue:text}`toronto_popn` people.
1301-
The number of people who report that English is their primary language at home
1302-
is much more meaningful when we report it in this context.
1303-
We can even go a step further and transform this count to a relative frequency
1166+
{glue:text}`toronto_popn` people.
1167+
The number of people who report that English is their primary language at home
1168+
is much more meaningful when we report it in this context.
1169+
We can even go a step further and transform this count to a relative frequency
13041170
or proportion.
1305-
We can do this by dividing the number of people reporting a given language
1306-
as their primary language at home by the number of people who live in Toronto.
1307-
For example,
1308-
the proportion of people who reported that their primary language at home
1171+
We can do this by dividing the number of people reporting a given language
1172+
as their primary language at home by the number of people who live in Toronto.
1173+
For example,
1174+
the proportion of people who reported that their primary language at home
13091175
was English in the 2016 Canadian census was {glue:text}`prop_eng_tor`
13101176
in Toronto.
13111177

1312-
Let's use `.assign` to create a new column in our data frame
1313-
that holds the proportion of people who speak English
1314-
for our five cities of focus in this chapter.
1315-
To accomplish this, we will need to do two tasks
1178+
Let's use `.assign` to create a new column in our data frame
1179+
that holds the proportion of people who speak English
1180+
for our five cities of focus in this chapter.
1181+
To accomplish this, we will need to do two tasks
13161182
beforehand:
13171183

13181184
1. Create a list containing the population values for the cities.
1319-
2. Filter the `official_langs` data frame
1185+
2. Filter the `official_langs` data frame
13201186
so that we only keep the rows where the language is English.
13211187

13221188
To create a list containing the population values for the five cities
@@ -1328,7 +1194,7 @@ city_pops = [5928040, 4098927, 2463431, 1392609, 1321426]
13281194
city_pops
13291195
```
13301196

1331-
And next, we will filter the `official_langs` data frame
1197+
And next, we will filter the `official_langs` data frame
13321198
so that we only keep the rows where the language is English.
13331199
We will name the new data frame we get from this `english_langs`:
13341200

@@ -1337,8 +1203,8 @@ english_langs = official_langs[official_langs["language"] == "English"]
13371203
english_langs
13381204
```
13391205

1340-
Finally, we can use `.assign` to create a new column,
1341-
named `most_at_home_proportion`, that will have value that corresponds to
1206+
Finally, we can use `.assign` to create a new column,
1207+
named `most_at_home_proportion`, that will have value that corresponds to
13421208
the proportion of people reporting English as their primary
13431209
language at home.
13441210
We will compute this by dividing the column by our vector of city populations.
@@ -1353,14 +1219,14 @@ english_langs
13531219

13541220
In the computation above, we had to ensure that we ordered the `city_pops` vector in the
13551221
same order as the cities were listed in the `english_langs` data frame.
1356-
This is because Python will perform the division computation we did by dividing
1357-
each element of the `most_at_home` column by each element of the
1222+
This is because Python will perform the division computation we did by dividing
1223+
each element of the `most_at_home` column by each element of the
13581224
`city_pops` list, matching them up by position.
13591225
Failing to do this would have resulted in the incorrect math being performed.
13601226

1361-
> **Note:** In more advanced data wrangling,
1362-
> one might solve this problem in a less error-prone way though using
1363-
> a technique called "joins".
1227+
> **Note:** In more advanced data wrangling,
1228+
> one might solve this problem in a less error-prone way though using
1229+
> a technique called "joins".
13641230
> We link to resources that discuss this in the additional
13651231
> resources at the end of this chapter.
13661232
@@ -1369,21 +1235,21 @@ Failing to do this would have resulted in the incorrect math being performed.
13691235
<!--
13701236
#### Creating a visualization with tidy data {-}
13711237
1372-
Now that we have cleaned and wrangled the data, we can make visualizations or do
1238+
Now that we have cleaned and wrangled the data, we can make visualizations or do
13731239
statistical analyses to answer questions about it! Let's suppose we want to
1374-
answer the question "what proportion of people in each city speak English
1240+
answer the question "what proportion of people in each city speak English
13751241
as their primary language at home in these five cities?" Since the data is
13761242
cleaned already, in a few short lines of code, we can use `ggplot` to create a
13771243
data visualization to answer this question! Here we create a bar plot to represent the proportions for
13781244
each region and color the proportions by language.
13791245
1380-
> Don't worry too much about the code to make this plot for now. We will cover
1246+
> Don't worry too much about the code to make this plot for now. We will cover
13811247
> visualizations in detail in Chapter \@ref(viz).
13821248
13831249
```{r 02-plot, out.width = "100%", fig.cap = "Bar plot of proportions of Canadians reporting English as the most often spoken language at home."}
13841250
ggplot(english_langs,
13851251
aes(
1386-
x = region,
1252+
x = region,
13871253
y = most_at_home_proportion
13881254
)
13891255
) +
@@ -1413,7 +1279,7 @@ frame called `data`:
14131279
2) filter for rows where another column, `other_col`, is more than 5, and
14141280
3) select only the new column `new_col` for those rows.
14151281

1416-
One way of performing these three steps is to just write
1282+
One way of performing these three steps is to just write
14171283
multiple lines of code, storing temporary objects as you go:
14181284

14191285
```{code-cell} ipython3
@@ -1450,7 +1316,7 @@ each subsequent line.
14501316
+++
14511317

14521318
Chaining the sequential functions solves this problem, resulting in cleaner and
1453-
easier-to-follow code.
1319+
easier-to-follow code.
14541320
The code below accomplishes the same thing as the previous
14551321
two code blocks:
14561322

@@ -1468,8 +1334,8 @@ output = (
14681334
:tags: [remove-cell]
14691335
14701336
# ``` {r eval = F}
1471-
# output <- select(filter(mutate(data, new_col = old_col * 2),
1472-
# other_col > 5),
1337+
# output <- select(filter(mutate(data, new_col = old_col * 2),
1338+
# other_col > 5),
14731339
# new_col)
14741340
# ```
14751341
# Code like this can also be difficult to understand. Functions compose (reading
@@ -1479,10 +1345,10 @@ output = (
14791345
14801346
# The *pipe operator* (`|>`) solves this problem, resulting in cleaner and
14811347
# easier-to-follow code. `|>` is built into R so you don't need to load any
1482-
# packages to use it.
1348+
# packages to use it.
14831349
# You can think of the pipe as a physical pipe. It takes the output from the
14841350
# function on the left-hand side of the pipe, and passes it as the first argument
1485-
# to the function on the right-hand side of the pipe.
1351+
# to the function on the right-hand side of the pipe.
14861352
# The code below accomplishes the same thing as the previous
14871353
# two code blocks:
14881354
```
@@ -1491,7 +1357,7 @@ output = (
14911357
> lines, similar to when we did this earlier in the chapter
14921358
> for long function calls. Again, this is allowed and recommended, especially when
14931359
> the chained function calls create a long line of code. Doing this makes
1494-
> your code more readable. When you do this, it is important to use parentheses
1360+
> your code more readable. When you do this, it is important to use parentheses
14951361
> to tell Python that your code is continuing onto the next line.
14961362
14971363
```{code-cell} ipython3
@@ -1507,28 +1373,28 @@ output = (
15071373
15081374
# > **Note:** In this textbook, we will be using the base R pipe operator syntax, `|>`.
15091375
# > This base R `|>` pipe operator was inspired by a previous version of the pipe
1510-
# > operator, `%>%`. The `%>%` pipe operator is not built into R
1511-
# > and is from the `magrittr` R package.
1512-
# > The `tidyverse` metapackage imports the `%>%` pipe operator via `dplyr`
1376+
# > operator, `%>%`. The `%>%` pipe operator is not built into R
1377+
# > and is from the `magrittr` R package.
1378+
# > The `tidyverse` metapackage imports the `%>%` pipe operator via `dplyr`
15131379
# > (which in turn imports the `magrittr` R package).
1514-
# > There are some other differences between `%>%` and `|>` related to
1515-
# > more advanced R uses, such as sharing and distributing code as R packages,
1516-
# > however, these are beyond the scope of this textbook.
1380+
# > There are some other differences between `%>%` and `|>` related to
1381+
# > more advanced R uses, such as sharing and distributing code as R packages,
1382+
# > however, these are beyond the scope of this textbook.
15171383
# > We have this note in the book to make the reader aware that `%>%` exists
1518-
# > as it is still commonly used in data analysis code and in many data science
1384+
# > as it is still commonly used in data analysis code and in many data science
15191385
# > books and other resources.
15201386
# > In most cases these two pipes are interchangeable and either can be used.
15211387
15221388
# \index{pipe}\index{aaapipesymbb@\%>\%|see{pipe}}
15231389
```
15241390

1525-
### Chaining `df[]` and `.loc`
1391+
### Chaining `[]` and `.loc`
15261392

15271393
+++
15281394

1529-
Let's work with the tidy `tidy_lang` data set from Section {ref}`str-split`,
1530-
which contains the number of Canadians reporting their primary language at home
1531-
and work for five major cities
1395+
Let's work with the tidy `tidy_lang` data set from Section {ref}`str-split`,
1396+
which contains the number of Canadians reporting their primary language at home
1397+
and work for five major cities
15321398
(Toronto, Montréal, Vancouver, Calgary, and Edmonton):
15331399

15341400
```{code-cell} ipython3
@@ -1537,7 +1403,7 @@ tidy_lang
15371403

15381404
Suppose we want to create a subset of the data with only the languages and
15391405
counts of each language spoken most at home for the city of Vancouver. To do
1540-
this, we can use the `df[]` and `.loc`. First, we use `df[]` to
1406+
this, we can use the `[]` and `.loc`. First, we use `[]` to
15411407
create a data frame called `van_data` that contains only values for Vancouver.
15421408

15431409
```{code-cell} ipython3
@@ -1554,8 +1420,8 @@ van_data_selected
15541420

15551421
Although this is valid code, there is a more readable approach we could take by
15561422
chaining the operations. With chaining, we do not need to create an intermediate
1557-
object to store the output from `df[]`. Instead, we can directly call `.loc` upon the
1558-
output of `df[]`:
1423+
object to store the output from `[]`. Instead, we can directly call `.loc` upon the
1424+
output of `[]`:
15591425

15601426
```{code-cell} ipython3
15611427
van_data_selected = tidy_lang[tidy_lang["region"] == "Vancouver"].loc[
@@ -1568,12 +1434,12 @@ van_data_selected
15681434
```{code-cell} ipython3
15691435
:tags: [remove-cell]
15701436
1571-
# But wait...Why do the `select` and `filter` function calls
1572-
# look different in these two examples?
1573-
# Remember: when you use the pipe,
1574-
# the output of the first function is automatically provided
1575-
# as the first argument for the function that comes after it.
1576-
# Therefore you do not specify the first argument in that function call.
1437+
# But wait...Why do the `select` and `filter` function calls
1438+
# look different in these two examples?
1439+
# Remember: when you use the pipe,
1440+
# the output of the first function is automatically provided
1441+
# as the first argument for the function that comes after it.
1442+
# Therefore you do not specify the first argument in that function call.
15771443
# In the code above,
15781444
# the first line is just the `tidy_lang` data frame with a pipe.
15791445
# The pipe passes the left-hand side (`tidy_lang`) to the first argument of the function on the right (`filter`),
@@ -1591,21 +1457,21 @@ approach is clearer and more readable.
15911457

15921458
+++
15931459

1594-
Chaining can be used with any method in Python.
1595-
Additionally, we can chain together more than two functions.
1596-
For example, we can chain together three functions to:
1460+
Chaining can be used with any method in Python.
1461+
Additionally, we can chain together more than two functions.
1462+
For example, we can chain together three functions to:
15971463

1598-
- extract rows (`df[]`) to include only those where the counts of the language most spoken at home are greater than 10,000,
1464+
- extract rows (`[]`) to include only those where the counts of the language most spoken at home are greater than 10,000,
15991465
- extract only the columns (`.loc`) corresponding to `region`, `language` and `most_at_home`, and
1600-
- sort the data frame rows in order (`.sort_values`) by counts of the language most spoken at home
1466+
- sort the data frame rows in order (`.sort_values`) by counts of the language most spoken at home
16011467
from smallest to largest.
16021468

16031469
```{index} pandas.DataFrame; sort_values
16041470
```
16051471

1606-
As we saw in Chapter {ref}`intro`,
1607-
we can use the `.sort_values` function
1608-
to order the rows in the data frame by the values of one or more columns.
1472+
As we saw in Chapter {ref}`intro`,
1473+
we can use the `.sort_values` function
1474+
to order the rows in the data frame by the values of one or more columns.
16091475
Here we pass the column name `most_at_home` to sort the data frame rows by the values in that column, in ascending order.
16101476

16111477
```{code-cell} ipython3
@@ -1626,7 +1492,7 @@ large_region_lang
16261492
# using it as the first argument of the first function. These two choices are equivalent,
16271493
# and we get the same result.
16281494
# ``` {r}
1629-
# large_region_lang <- tidy_lang |>
1495+
# large_region_lang <- tidy_lang |>
16301496
# filter(most_at_home > 10000) |>
16311497
# select(region, language, most_at_home) |>
16321498
# arrange(most_at_home)
@@ -1636,12 +1502,12 @@ large_region_lang
16361502
```
16371503

16381504
Now that we've shown you chaining as an alternative to storing
1639-
temporary objects and composing code, does this mean you should *never* store
1640-
temporary objects or compose code? Not necessarily!
1641-
There are times when you will still want to do these things.
1642-
For example, you might store a temporary object before feeding it into a plot function
1505+
temporary objects and composing code, does this mean you should *never* store
1506+
temporary objects or compose code? Not necessarily!
1507+
There are times when you will still want to do these things.
1508+
For example, you might store a temporary object before feeding it into a plot function
16431509
so you can iteratively change the plot without having to
1644-
redo all of your data transformations.
1510+
redo all of your data transformations.
16451511
Additionally, chaining many functions can be overwhelming and difficult to debug;
16461512
you may want to store a temporary object midway through to inspect your result
16471513
before moving on with further steps.
@@ -1658,12 +1524,12 @@ before moving on with further steps.
16581524
```
16591525

16601526
As a part of many data analyses, we need to calculate a summary value for the
1661-
data (a *summary statistic*).
1662-
Examples of summary statistics we might want to calculate
1663-
are the number of observations, the average/mean value for a column,
1664-
the minimum value, etc.
1665-
Oftentimes,
1666-
this summary statistic is calculated from the values in a data frame column,
1527+
data (a *summary statistic*).
1528+
Examples of summary statistics we might want to calculate
1529+
are the number of observations, the average/mean value for a column,
1530+
the minimum value, etc.
1531+
Oftentimes,
1532+
this summary statistic is calculated from the values in a data frame column,
16671533
or columns, as shown in {numref}`fig:summarize`.
16681534

16691535
+++ {"tags": []}
@@ -1684,11 +1550,11 @@ First a reminder of what `region_lang` looks like:
16841550
```{code-cell} ipython3
16851551
:tags: [remove-cell]
16861552
1687-
# A useful `dplyr` function for calculating summary statistics is `summarize`,
1553+
# A useful `dplyr` function for calculating summary statistics is `summarize`,
16881554
# where the first argument is the data frame and subsequent arguments
1689-
# are the summaries we want to perform.
1690-
# Here we show how to use the `summarize` function to calculate the minimum
1691-
# and maximum number of Canadians
1555+
# are the summaries we want to perform.
1556+
# Here we show how to use the `summarize` function to calculate the minimum
1557+
# and maximum number of Canadians
16921558
# reporting a particular language as their primary language at home.
16931559
# First a reminder of what `region_lang` looks like:
16941560
```
@@ -1698,9 +1564,9 @@ region_lang = pd.read_csv("data/region_lang.csv")
16981564
region_lang
16991565
```
17001566

1701-
We apply `min` to calculate the minimum
1702-
and `max` to calculate maximum number of Canadians
1703-
reporting a particular language as their primary language at home,
1567+
We apply `min` to calculate the minimum
1568+
and `max` to calculate maximum number of Canadians
1569+
reporting a particular language as their primary language at home,
17041570
for any region, and `.assign` a column name to each:
17051571

17061572
```{code-cell} ipython3
@@ -1744,33 +1610,33 @@ people.
17441610
```{index} see: NaN; missing data
17451611
```
17461612

1747-
In `pandas` DataFrame, the value `NaN` is often used to denote missing data.
1748-
Many of the base python statistical summary functions
1749-
(e.g., `max`, `min`, `sum`, etc) will return `NaN`
1750-
when applied to columns containing `NaN` values.
1751-
Usually that is not what we want to happen;
1613+
In `pandas` DataFrame, the value `NaN` is often used to denote missing data.
1614+
Many of the base python statistical summary functions
1615+
(e.g., `max`, `min`, `sum`, etc) will return `NaN`
1616+
when applied to columns containing `NaN` values.
1617+
Usually that is not what we want to happen;
17521618
instead, we would usually like Python to ignore the missing entries
17531619
and calculate the summary statistic using all of the other non-`NaN` values
17541620
in the column.
1755-
Fortunately `pandas` provides many equivalent methods (e.g., `.max`, `.min`, `.sum`, etc) to
1621+
Fortunately `pandas` provides many equivalent methods (e.g., `.max`, `.min`, `.sum`, etc) to
17561622
these summary functions while providing an extra argument `skipna` that lets
17571623
us tell the function what to do when it encounters `NaN` values.
17581624
In particular, if we specify `skipna=True` (default), the function will ignore
17591625
missing values and return a summary of all the non-missing entries.
17601626
We show an example of this below.
17611627

17621628
First we create a new version of the `region_lang` data frame,
1763-
named `region_lang_na`, that has a seemingly innocuous `NaN`
1629+
named `region_lang_na`, that has a seemingly innocuous `NaN`
17641630
in the first row of the `most_at_home` column:
17651631

17661632
```{code-cell} ipython3
17671633
:tags: [remove-cell]
17681634
1769-
# In data frames in R, the value `NA` is often used to denote missing data.
1770-
# Many of the base R statistical summary functions
1771-
# (e.g., `max`, `min`, `mean`, `sum`, etc) will return `NA`
1635+
# In data frames in R, the value `NA` is often used to denote missing data.
1636+
# Many of the base R statistical summary functions
1637+
# (e.g., `max`, `min`, `mean`, `sum`, etc) will return `NA`
17721638
# when applied to columns containing `NA` values. \index{missing data}\index{NA|see{missing data}}
1773-
# Usually that is not what we want to happen;
1639+
# Usually that is not what we want to happen;
17741640
# instead, we would usually like R to ignore the missing entries
17751641
# and calculate the summary statistic using all of the other non-`NA` values
17761642
# in the column.
@@ -1792,8 +1658,8 @@ region_lang_na.loc[0, "most_at_home"] = np.nan
17921658
region_lang_na
17931659
```
17941660

1795-
Now if we apply the Python built-in summary function as above,
1796-
we see that we no longer get the minimum and maximum returned,
1661+
Now if we apply the Python built-in summary function as above,
1662+
we see that we no longer get the minimum and maximum returned,
17971663
but just an `NaN` instead!
17981664

17991665
```{code-cell} ipython3
@@ -1827,21 +1693,21 @@ lang_summary_na
18271693
```{index} pandas.DataFrame; groupby
18281694
```
18291695

1830-
A common pairing with summary functions is `.groupby`. Pairing these functions
1696+
A common pairing with summary functions is `.groupby`. Pairing these functions
18311697
together can let you summarize values for subgroups within a data set,
1832-
as illustrated in {numref}`fig:summarize-groupby`.
1833-
For example, we can use `.groupby` to group the regions of the `tidy_lang` data frame and then calculate the minimum and maximum number of Canadians
1834-
reporting the language as the primary language at home
1698+
as illustrated in {numref}`fig:summarize-groupby`.
1699+
For example, we can use `.groupby` to group the regions of the `tidy_lang` data frame and then calculate the minimum and maximum number of Canadians
1700+
reporting the language as the primary language at home
18351701
for each of the regions in the data set.
18361702

18371703
```{code-cell} ipython3
18381704
:tags: [remove-cell]
18391705
18401706
# A common pairing with `summarize` is `group_by`. Pairing these functions \index{group\_by}
18411707
# together can let you summarize values for subgroups within a data set,
1842-
# as illustrated in Figure \@ref(fig:summarize-groupby).
1843-
# For example, we can use `group_by` to group the regions of the `tidy_lang` data frame and then calculate the minimum and maximum number of Canadians
1844-
# reporting the language as the primary language at home
1708+
# as illustrated in Figure \@ref(fig:summarize-groupby).
1709+
# For example, we can use `group_by` to group the regions of the `tidy_lang` data frame and then calculate the minimum and maximum number of Canadians
1710+
# reporting the language as the primary language at home
18451711
# for each of the regions in the data set.
18461712
18471713
# (ref:summarize-groupby) `summarize` and `group_by` is useful for calculating summary statistics on one or more column(s) for each group. It creates a new data frame&mdash;with one row for each group&mdash;containing the summary statistic(s) for each column being summarized. It also creates a column listing the value of the grouping variable. The darker, top row of each table represents the column headers. The gray, blue, and green colored rows correspond to the rows that belong to each of the three groups being represented in this cartoon example.
@@ -1888,11 +1754,11 @@ Notice that `.groupby` converts a `DataFrame` object to a `DataFrameGroupBy` obj
18881754
```{code-cell} ipython3
18891755
:tags: [remove-cell]
18901756
1891-
# Notice that `group_by` on its own doesn't change the way the data looks.
1892-
# In the output below, the grouped data set looks the same,
1893-
# and it doesn't *appear* to be grouped by `region`.
1894-
# Instead, `group_by` simply changes how other functions work with the data,
1895-
# as we saw with `summarize` above.
1757+
# Notice that `group_by` on its own doesn't change the way the data looks.
1758+
# In the output below, the grouped data set looks the same,
1759+
# and it doesn't *appear* to be grouped by `region`.
1760+
# Instead, `group_by` simply changes how other functions work with the data,
1761+
# as we saw with `summarize` above.
18961762
```
18971763

18981764
```{code-cell} ipython3
@@ -1905,23 +1771,23 @@ region_lang.groupby("region")
19051771

19061772
Sometimes we need to summarize statistics across many columns.
19071773
An example of this is illustrated in {numref}`fig:summarize-across`.
1908-
In such a case, using summary functions alone means that we have to
1774+
In such a case, using summary functions alone means that we have to
19091775
type out the name of each column we want to summarize.
1910-
In this section we will meet two strategies for performing this task.
1776+
In this section we will meet two strategies for performing this task.
19111777
First we will see how we can do this using `.iloc[]` to slice the columns before applying summary functions.
1912-
Then we will also explore how we can use a more general iteration function,
1778+
Then we will also explore how we can use a more general iteration function,
19131779
`.apply`, to also accomplish this.
19141780

19151781
```{code-cell} ipython3
19161782
:tags: [remove-cell]
19171783
19181784
# Sometimes we need to summarize statistics across many columns.
19191785
# An example of this is illustrated in Figure \@ref(fig:summarize-across).
1920-
# In such a case, using `summarize` alone means that we have to
1786+
# In such a case, using `summarize` alone means that we have to
19211787
# type out the name of each column we want to summarize.
1922-
# In this section we will meet two strategies for performing this task.
1788+
# In this section we will meet two strategies for performing this task.
19231789
# First we will see how we can do this using `summarize` + `across`.
1924-
# Then we will also explore how we can use a more general iteration function,
1790+
# Then we will also explore how we can use a more general iteration function,
19251791
# `map`, to also accomplish this.
19261792
```
19271793

@@ -1943,9 +1809,9 @@ Then we will also explore how we can use a more general iteration function,
19431809
```{index} column range
19441810
```
19451811

1946-
Recall that in the Section {ref}`loc-iloc`, we can use `.iloc[]` to extract a range of columns with indices. Here we demonstrate finding the maximum value
1812+
Recall that in the Section {ref}`loc-iloc`, we can use `.iloc[]` to extract a range of columns with indices. Here we demonstrate finding the maximum value
19471813
of each of the numeric
1948-
columns of the `region_lang` data set through pairing `.iloc[]` and `.max`. This means that the
1814+
columns of the `region_lang` data set through pairing `.iloc[]` and `.max`. This means that the
19491815
summary methods (*e.g.* `.min`, `.max`, `.sum` etc.) can be used for data frames as well.
19501816

19511817
```{code-cell} ipython3
@@ -1958,35 +1824,35 @@ jupyter:
19581824
source_hidden: true
19591825
tags: [remove-cell]
19601826
---
1961-
# To summarize statistics across many columns, we can use the
1827+
# To summarize statistics across many columns, we can use the
19621828
# `summarize` function we have just recently learned about.
1963-
# However, in such a case, using `summarize` alone means that we have to
1964-
# type out the name of each column we want to summarize.
1829+
# However, in such a case, using `summarize` alone means that we have to
1830+
# type out the name of each column we want to summarize.
19651831
# To do this more efficiently, we can pair `summarize` with `across` \index{across}
19661832
# and use a colon `:` to specify a range of columns we would like \index{column range}
19671833
# to perform the statistical summaries on.
1968-
# Here we demonstrate finding the maximum value
1834+
# Here we demonstrate finding the maximum value
19691835
# of each of the numeric
19701836
# columns of the `region_lang` data set.
19711837
19721838
# ``` {r 02-across-data}
19731839
# region_lang |>
19741840
# summarize(across(mother_tongue:lang_known, max))
1975-
# ```
1841+
# ```
19761842
1977-
# > **Note:** Similar to when we use base R statistical summary functions
1978-
# > (e.g., `max`, `min`, `mean`, `sum`, etc) with `summarize` alone,
1979-
# > the use of the `summarize` + `across` functions paired
1843+
# > **Note:** Similar to when we use base R statistical summary functions
1844+
# > (e.g., `max`, `min`, `mean`, `sum`, etc) with `summarize` alone,
1845+
# > the use of the `summarize` + `across` functions paired
19801846
# > with base R statistical summary functions
1981-
# > also return `NA`s when we apply them to columns that
1847+
# > also return `NA`s when we apply them to columns that
19821848
# > contain `NA`s in the data frame. \index{missing data}
1983-
# >
1849+
# >
19841850
# > To avoid this, again we need to add the argument `na.rm = TRUE`,
19851851
# > but in this case we need to use it a little bit differently.
19861852
# > In this case, we need to add a `,` and then `na.rm = TRUE`,
1987-
# > after specifying the function we want `summarize` + `across` to apply,
1853+
# > after specifying the function we want `summarize` + `across` to apply,
19881854
# > as illustrated below:
1989-
# >
1855+
# >
19901856
# > ``` {r}
19911857
# > region_lang_na |>
19921858
# > summarize(across(mother_tongue:lang_known, max, na.rm = TRUE))
@@ -2005,9 +1871,9 @@ An alternative to aggregating on a dataframe
20051871
for applying a function to many columns is the `.apply` method.
20061872
Let's again find the maximum value of each column of the
20071873
`region_lang` data frame, but using `.apply` with the `max` function this time.
2008-
We focus on the two arguments of `.apply`:
1874+
We focus on the two arguments of `.apply`:
20091875
the function that you would like to apply to each column, and the `axis` along which the function will be applied (`0` for columns, `1` for rows).
2010-
Note that `.apply` does not have an argument
1876+
Note that `.apply` does not have an argument
20111877
to specify *which* columns to apply the function to.
20121878
Therefore, we will use the `.iloc[]` before calling `.apply`
20131879
to choose the columns for which we want the maximum.
@@ -2018,14 +1884,14 @@ jupyter:
20181884
source_hidden: true
20191885
tags: [remove-cell]
20201886
---
2021-
# An alternative to `summarize` and `across`
1887+
# An alternative to `summarize` and `across`
20221888
# for applying a function to many columns is the `map` family of functions. \index{map}
20231889
# Let's again find the maximum value of each column of the
20241890
# `region_lang` data frame, but using `map` with the `max` function this time.
2025-
# `map` takes two arguments:
2026-
# an object (a vector, data frame or list) that you want to apply the function to,
1891+
# `map` takes two arguments:
1892+
# an object (a vector, data frame or list) that you want to apply the function to,
20271893
# and the function that you would like to apply to each column.
2028-
# Note that `map` does not have an argument
1894+
# Note that `map` does not have an argument
20291895
# to specify *which* columns to apply the function to.
20301896
# Therefore, we will use the `select` function before calling `map`
20311897
# to choose the columns for which we want the maximum.
@@ -2038,15 +1904,15 @@ pd.DataFrame(region_lang.iloc[:, 3:].apply(max, axis=0)).T
20381904
```{index} missing data
20391905
```
20401906

2041-
> **Note:** Similar to when we use base Python statistical summary functions
2042-
> (e.g., `max`, `min`, `sum`, etc.) when there are `NaN`s,
1907+
> **Note:** Similar to when we use base Python statistical summary functions
1908+
> (e.g., `max`, `min`, `sum`, etc.) when there are `NaN`s,
20431909
> `.apply` functions paired with base Python statistical summary functions
2044-
> also return `NaN` values when we apply them to columns that
2045-
> contain `NaN` values.
2046-
>
1910+
> also return `NaN` values when we apply them to columns that
1911+
> contain `NaN` values.
1912+
>
20471913
> To avoid this, again we need to use the `pandas` variants of summary functions (*i.e.*
20481914
> `.max`, `.min`, `.sum`, etc.) with `skipna=True`.
2049-
> When we use this with `.apply`, we do this by constructing a anonymous function that calls
1915+
> When we use this with `.apply`, we do this by constructing a anonymous function that calls
20501916
> the `.max` method with `skipna=True`, as illustrated below:
20511917
20521918
```{code-cell} ipython3
@@ -2055,17 +1921,17 @@ pd.DataFrame(
20551921
).T
20561922
```
20571923

2058-
The `.apply` function is generally quite useful for solving many problems
2059-
involving repeatedly applying functions in Python.
2060-
Additionally, a variant of `.apply` is `.applymap`,
1924+
The `.apply` function is generally quite useful for solving many problems
1925+
involving repeatedly applying functions in Python.
1926+
Additionally, a variant of `.apply` is `.applymap`,
20611927
which can be used to apply functions element-wise.
20621928
To learn more about these functions, see the additional resources
20631929
section at the end of this chapter.
20641930

20651931
+++ {"jp-MarkdownHeadingCollapsed": true, "tags": ["remove-cell"]}
20661932

20671933
<!-- > **Note:** The `map` function comes from the `purrr` package. But since
2068-
> `purrr` is part of the tidyverse, once we call `library(tidyverse)` we
1934+
> `purrr` is part of the tidyverse, once we call `library(tidyverse)` we
20691935
> do not need to load the `purrr` package separately.
20701936
20711937
The output looks a bit weird... we passed in a data frame, but the output
@@ -2080,7 +1946,7 @@ region_lang |>
20801946
```
20811947
20821948
So what do we do? Should we convert this to a data frame? We could, but a
2083-
simpler alternative is to just use a different `map` function. There
1949+
simpler alternative is to just use a different `map` function. There
20841950
are quite a few to choose from, they all work similarly, but
20851951
their name reflects the type of output you want from the mapping operation.
20861952
Table \@ref(tab:map-table) lists the commonly used `map` functions as well
@@ -2107,24 +1973,24 @@ region_lang |>
21071973
map_dfr(max)
21081974
```
21091975
2110-
> **Note:** Similar to when we use base R statistical summary functions
2111-
> (e.g., `max`, `min`, `mean`, `sum`, etc.) with `summarize`,
1976+
> **Note:** Similar to when we use base R statistical summary functions
1977+
> (e.g., `max`, `min`, `mean`, `sum`, etc.) with `summarize`,
21121978
> `map` functions paired with base R statistical summary functions
2113-
> also return `NA` values when we apply them to columns that
1979+
> also return `NA` values when we apply them to columns that
21141980
> contain `NA` values. \index{missing data}
2115-
>
1981+
>
21161982
> To avoid this, again we need to add the argument `na.rm = TRUE`.
2117-
> When we use this with `map`, we do this by adding a `,`
1983+
> When we use this with `map`, we do this by adding a `,`
21181984
> and then `na.rm = TRUE` after specifying the function, as illustrated below:
2119-
>
1985+
>
21201986
> ``` {r}
21211987
> region_lang_na |>
21221988
> select(mother_tongue:lang_known) |>
21231989
> map_dfr(max, na.rm = TRUE)
21241990
> ```
21251991
2126-
The `map` functions are generally quite useful for solving many problems
2127-
involving repeatedly applying functions in R.
1992+
The `map` functions are generally quite useful for solving many problems
1993+
involving repeatedly applying functions in R.
21281994
Additionally, their use is not limited to columns of a data frame;
21291995
`map` family functions can be used to apply functions to elements of a vector,
21301996
or a list, and even to lists of (nested!) data frames.
@@ -2135,8 +2001,8 @@ section at the end of this chapter. -->
21352001

21362002
## Apply functions across many columns with `.apply`
21372003

2138-
Sometimes we need to apply a function to many columns in a data frame.
2139-
For example, we would need to do this when converting units of measurements across many columns.
2004+
Sometimes we need to apply a function to many columns in a data frame.
2005+
For example, we would need to do this when converting units of measurements across many columns.
21402006
We illustrate such a data transformation in {numref}`fig:mutate-across`.
21412007

21422008
+++ {"tags": []}
@@ -2150,11 +2016,11 @@ We illustrate such a data transformation in {numref}`fig:mutate-across`.
21502016

21512017
+++
21522018

2153-
For example,
2154-
imagine that we wanted to convert all the numeric columns
2155-
in the `region_lang` data frame from `int64` type to `int32` type
2019+
For example,
2020+
imagine that we wanted to convert all the numeric columns
2021+
in the `region_lang` data frame from `int64` type to `int32` type
21562022
using the `.as_type` function.
2157-
When we revisit the `region_lang` data frame,
2023+
When we revisit the `region_lang` data frame,
21582024
we can see that this would be the columns from `mother_tongue` to `lang_known`.
21592025

21602026
```{code-cell} ipython3
@@ -2163,11 +2029,11 @@ jupyter:
21632029
source_hidden: true
21642030
tags: [remove-cell]
21652031
---
2166-
# For example,
2167-
# imagine that we wanted to convert all the numeric columns
2168-
# in the `region_lang` data frame from double type to integer type
2032+
# For example,
2033+
# imagine that we wanted to convert all the numeric columns
2034+
# in the `region_lang` data frame from double type to integer type
21692035
# using the `as.integer` function.
2170-
# When we revisit the `region_lang` data frame,
2036+
# When we revisit the `region_lang` data frame,
21712037
# we can see that this would be the columns from `mother_tongue` to `lang_known`.
21722038
```
21732039

@@ -2179,12 +2045,12 @@ region_lang
21792045
```
21802046

21812047
To accomplish such a task, we can use `.apply`.
2182-
This works in a similar way for column selection,
2048+
This works in a similar way for column selection,
21832049
as we saw when we used in Section {ref}`apply-summary` earlier.
2184-
As we did above,
2050+
As we did above,
21852051
we again use `.iloc` to specify the columns
21862052
as well as the `.apply` to specify the function we want to apply on these columns.
2187-
However, a key difference here is that we are not using aggregating function here,
2053+
However, a key difference here is that we are not using aggregating function here,
21882054
which means that we get back a data frame with the same number of rows.
21892055

21902056
```{code-cell} ipython3
@@ -2194,17 +2060,17 @@ jupyter:
21942060
tags: [remove-cell]
21952061
---
21962062
# To accomplish such a task, we can use `mutate` paired with `across`. \index{across}
2197-
# This works in a similar way for column selection,
2063+
# This works in a similar way for column selection,
21982064
# as we saw when we used `summarize` + `across` earlier.
2199-
# As we did above,
2065+
# As we did above,
22002066
# we again use `across` to specify the columns using `select` syntax
22012067
# as well as the function we want to apply on the specified columns.
2202-
# However, a key difference here is that we are using `mutate`,
2068+
# However, a key difference here is that we are using `mutate`,
22032069
# which means that we get back a data frame with the same number of rows.
22042070
```
22052071

22062072
```{code-cell} ipython3
2207-
region_lang.dtypes
2073+
region_lang.info()
22082074
```
22092075

22102076
```{code-cell} ipython3
@@ -2214,19 +2080,19 @@ region_lang_int32
22142080
```
22152081

22162082
```{code-cell} ipython3
2217-
region_lang_int32.dtypes
2083+
region_lang_int32.info()
22182084
```
22192085

22202086
We see that we get back a data frame
22212087
with the same number of columns and rows.
2222-
The only thing that changes is the transformation we applied
2088+
The only thing that changes is the transformation we applied
22232089
to the specified columns (here `mother_tongue` to `lang_known`).
22242090

22252091
+++
22262092

22272093
## Apply functions across columns within one row with `.apply`
22282094

2229-
What if you want to apply a function across columns but within one row?
2095+
What if you want to apply a function across columns but within one row?
22302096
We illustrate such a data transformation in {numref}`fig:rowwise`.
22312097

22322098
+++ {"tags": []}
@@ -2241,12 +2107,12 @@ We illustrate such a data transformation in {numref}`fig:rowwise`.
22412107
+++
22422108

22432109
For instance, suppose we want to know the maximum value between `mother_tongue`,
2244-
`most_at_home`, `most_at_work`
2110+
`most_at_home`, `most_at_work`
22452111
and `lang_known` for each language and region
22462112
in the `region_lang` data set.
22472113
In other words, we want to apply the `max` function *row-wise.*
22482114
Before we use `.apply`, we will again use `.iloc` to select only the count columns
2249-
so we can see all the columns in the data frame's output easily in the book.
2115+
so we can see all the columns in the data frame's output easily in the book.
22502116
So for this demonstration, the data set we are operating on looks like this:
22512117

22522118
```{code-cell} ipython3
@@ -2256,15 +2122,15 @@ jupyter:
22562122
tags: [remove-cell]
22572123
---
22582124
# For instance, suppose we want to know the maximum value between `mother_tongue`,
2259-
# `most_at_home`, `most_at_work`
2125+
# `most_at_home`, `most_at_work`
22602126
# and `lang_known` for each language and region
22612127
# in the `region_lang` data set.
22622128
# In other words, we want to apply the `max` function *row-wise.*
2263-
# We will use the (aptly named) `rowwise` function in combination with `mutate`
2264-
# to accomplish this task.
2129+
# We will use the (aptly named) `rowwise` function in combination with `mutate`
2130+
# to accomplish this task.
22652131
22662132
# Before we apply `rowwise`, we will `select` only the count columns \index{rowwise}
2267-
# so we can see all the columns in the data frame's output easily in the book.
2133+
# so we can see all the columns in the data frame's output easily in the book.
22682134
# So for this demonstration, the data set we are operating on looks like this:
22692135
```
22702136

@@ -2274,7 +2140,7 @@ region_lang.iloc[:, 3:]
22742140

22752141
Now we use `.apply` with argument `axis=1`, to tell Python that we would like
22762142
the `max` function to be applied across, and within, a row,
2277-
as opposed to being applied on a column
2143+
as opposed to being applied on a column
22782144
(which is the default behavior of `.apply`):
22792145

22802146
```{code-cell} ipython3
@@ -2285,7 +2151,7 @@ tags: [remove-cell]
22852151
---
22862152
# Now we apply `rowwise` before `mutate`, to tell R that we would like
22872153
# the mutate function to be applied across, and within, a row,
2288-
# as opposed to being applied on a column
2154+
# as opposed to being applied on a column
22892155
# (which is the default behavior of `mutate`):
22902156
```
22912157

@@ -2297,7 +2163,7 @@ region_lang_rowwise = region_lang.assign(
22972163
region_lang_rowwise
22982164
```
22992165

2300-
We see that we get an additional column added to the data frame,
2166+
We see that we get an additional column added to the data frame,
23012167
named `maximum`, which is the maximum value between `mother_tongue`,
23022168
`most_at_home`, `most_at_work` and `lang_known` for each language
23032169
and region.
@@ -2308,52 +2174,52 @@ jupyter:
23082174
source_hidden: true
23092175
tags: [remove-cell]
23102176
---
2311-
# Similar to `group_by`,
2312-
# `rowwise` doesn't appear to do anything when it is called by itself.
2313-
# However, we can apply `rowwise` in combination
2177+
# Similar to `group_by`,
2178+
# `rowwise` doesn't appear to do anything when it is called by itself.
2179+
# However, we can apply `rowwise` in combination
23142180
# with other functions to change how these other functions operate on the data.
2315-
# Notice if we used `mutate` without `rowwise`,
2316-
# we would have computed the maximum value across *all* rows
2317-
# rather than the maximum value for *each* row.
2181+
# Notice if we used `mutate` without `rowwise`,
2182+
# we would have computed the maximum value across *all* rows
2183+
# rather than the maximum value for *each* row.
23182184
# Below we show what would have happened had we not used
2319-
# `rowwise`. In particular, the same maximum value is reported
2185+
# `rowwise`. In particular, the same maximum value is reported
23202186
# in every single row; this code does not provide the desired result.
23212187
23222188
# ```{r}
2323-
# region_lang |>
2189+
# region_lang |>
23242190
# select(mother_tongue:lang_known) |>
2325-
# mutate(maximum = max(c(mother_tongue,
2326-
# most_at_home,
2327-
# most_at_home,
2191+
# mutate(maximum = max(c(mother_tongue,
2192+
# most_at_home,
2193+
# most_at_home,
23282194
# lang_known)))
23292195
# ```
23302196
```
23312197

23322198
## Summary
23332199

2334-
Cleaning and wrangling data can be a very time-consuming process. However,
2200+
Cleaning and wrangling data can be a very time-consuming process. However,
23352201
it is a critical step in any data analysis. We have explored many different
2336-
functions for cleaning and wrangling data into a tidy format.
2337-
{numref}`tab:summary-functions-table` summarizes some of the key wrangling
2338-
functions we learned in this chapter. In the following chapters, you will
2339-
learn how you can take this tidy data and do so much more with it to answer your
2202+
functions for cleaning and wrangling data into a tidy format.
2203+
{numref}`tab:summary-functions-table` summarizes some of the key wrangling
2204+
functions we learned in this chapter. In the following chapters, you will
2205+
learn how you can take this tidy data and do so much more with it to answer your
23402206
burning data science questions!
23412207

23422208
+++
23432209

2344-
```{table} Summary of wrangling functions
2210+
```{table} Summary of wrangling functions
23452211
:name: tab:summary-functions-table
23462212
23472213
| Function | Description |
2348-
| --- | ----------- |
2214+
| --- | ----------- |
23492215
| `.agg` | calculates aggregated summaries of inputs |
2350-
| `.apply` | allows you to apply function(s) to multiple columns/rows |
2351-
| `.assign` | adds or modifies columns in a data frame |
2216+
| `.apply` | allows you to apply function(s) to multiple columns/rows |
2217+
| `.assign` | adds or modifies columns in a data frame |
23522218
| `.groupby` | allows you to apply function(s) to groups of rows |
23532219
| `.iloc` | subsets columns/rows of a data frame using integer indices |
2354-
| `.loc` | subsets columns/rows of a data frame using labels |
2220+
| `.loc` | subsets columns/rows of a data frame using labels |
23552221
| `.melt` | generally makes the data frame longer and narrower |
2356-
| `.pivot` | generally makes a data frame wider and decreases the number of rows |
2222+
| `.pivot` | generally makes a data frame wider and decreases the number of rows |
23572223
| `.str.split` | splits up a string column into multiple columns |
23582224
```
23592225

@@ -2365,37 +2231,37 @@ tags: [remove-cell]
23652231
---
23662232
# ## Summary
23672233
2368-
# Cleaning and wrangling data can be a very time-consuming process. However,
2234+
# Cleaning and wrangling data can be a very time-consuming process. However,
23692235
# it is a critical step in any data analysis. We have explored many different
2370-
# functions for cleaning and wrangling data into a tidy format.
2371-
# Table \@ref(tab:summary-functions-table) summarizes some of the key wrangling
2372-
# functions we learned in this chapter. In the following chapters, you will
2373-
# learn how you can take this tidy data and do so much more with it to answer your
2236+
# functions for cleaning and wrangling data into a tidy format.
2237+
# Table \@ref(tab:summary-functions-table) summarizes some of the key wrangling
2238+
# functions we learned in this chapter. In the following chapters, you will
2239+
# learn how you can take this tidy data and do so much more with it to answer your
23742240
# burning data science questions!
23752241
23762242
# \newpage
23772243
2378-
# Table: (#tab:summary-functions-table) Summary of wrangling functions
2244+
# Table: (#tab:summary-functions-table) Summary of wrangling functions
23792245
23802246
# | Function | Description |
2381-
# | --- | ----------- |
2382-
# | `across` | allows you to apply function(s) to multiple columns |
2383-
# | `filter` | subsets rows of a data frame |
2247+
# | --- | ----------- |
2248+
# | `across` | allows you to apply function(s) to multiple columns |
2249+
# | `filter` | subsets rows of a data frame |
23842250
# | `group_by` | allows you to apply function(s) to groups of rows |
23852251
# | `mutate` | adds or modifies columns in a data frame |
23862252
# | `map` | general iteration function |
23872253
# | `pivot_longer` | generally makes the data frame longer and narrower |
2388-
# | `pivot_wider` | generally makes a data frame wider and decreases the number of rows |
2389-
# | `rowwise` | applies functions across columns within one row |
2390-
# | `separate` | splits up a character column into multiple columns |
2254+
# | `pivot_wider` | generally makes a data frame wider and decreases the number of rows |
2255+
# | `rowwise` | applies functions across columns within one row |
2256+
# | `separate` | splits up a character column into multiple columns |
23912257
# | `select` | subsets columns of a data frame |
23922258
# | `summarize` | calculates summaries of inputs |
23932259
```
23942260

23952261
## Exercises
23962262

2397-
Practice exercises for the material covered in this chapter
2398-
can be found in the accompanying
2263+
Practice exercises for the material covered in this chapter
2264+
can be found in the accompanying
23992265
[worksheets repository](https://github.com/UBC-DSCI/data-science-a-first-intro-worksheets#readme)
24002266
in the "Cleaning and wrangling data" row.
24012267
You can launch an interactive version of the worksheet in your browser by clicking the "launch binder" button.
@@ -2407,7 +2273,7 @@ and guidance that the worksheets provide will function as intended.
24072273

24082274
+++ {"tags": []}
24092275

2410-
## Additional resources
2276+
## Additional resources
24112277

24122278
- The [`pandas` package documentation](https://pandas.pydata.org/docs/reference/index.html) is
24132279
another resource to learn more about the functions in this
@@ -2433,14 +2299,14 @@ jupyter:
24332299
source_hidden: true
24342300
tags: [remove-cell]
24352301
---
2436-
# ## Additional resources
2302+
# ## Additional resources
24372303
24382304
# - As we mentioned earlier, `tidyverse` is actually an *R
24392305
# meta package*: it installs and loads a collection of R packages that all
24402306
# follow the tidy data philosophy we discussed above. One of the `tidyverse`
24412307
# packages is `dplyr`&mdash;a data wrangling workhorse. You have already met many
2442-
# of `dplyr`'s functions
2443-
# (`select`, `filter`, `mutate`, `arrange`, `summarize`, and `group_by`).
2308+
# of `dplyr`'s functions
2309+
# (`select`, `filter`, `mutate`, `arrange`, `summarize`, and `group_by`).
24442310
# To learn more about these functions and meet a few more useful
24452311
# functions, we recommend you check out Chapters 5-9 of the [STAT545 online notes](https://stat545.com/).
24462312
# of the data wrangling, exploration, and analysis with R book.
@@ -2450,10 +2316,10 @@ tags: [remove-cell]
24502316
# The site also provides a very nice cheat sheet that summarizes many of the
24512317
# data wrangling functions from this chapter.
24522318
# - Check out the [`tidyselect` R package page](https://tidyselect.r-lib.org/index.html)
2453-
# [@tidyselect] for a comprehensive list of `select` helpers.
2454-
# These helpers can be used to choose columns in a data frame when paired with the `select` function
2319+
# [@tidyselect] for a comprehensive list of `select` helpers.
2320+
# These helpers can be used to choose columns in a data frame when paired with the `select` function
24552321
# (and other functions that use the `tidyselect` syntax, such as `pivot_longer`).
2456-
# The [documentation for `select` helpers](https://tidyselect.r-lib.org/reference/select_helpers.html)
2322+
# The [documentation for `select` helpers](https://tidyselect.r-lib.org/reference/select_helpers.html)
24572323
# is a useful reference to find the helper you need for your particular problem.
24582324
# - *R for Data Science* [@wickham2016r] has a few chapters related to
24592325
# data wrangling that go into more depth than this book. For example, the
@@ -2476,4 +2342,4 @@ tags: [remove-cell]
24762342

24772343
```{bibliography}
24782344
:filter: docname in docnames
2479-
```
2345+
```

0 commit comments

Comments
 (0)
Please sign in to comment.