Skip to content

Ch3 fig cleanup #103

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jan 6, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file modified source/img/code-figures.pptx
Binary file not shown.
Binary file modified source/img/wrangling/pandas_melt_args_labels.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified source/img/wrangling/pandas_melt_wide-long.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified source/img/wrangling/pandas_pivot_args_labels.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified source/img/wrangling/pandas_pivot_long-wide.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
62 changes: 27 additions & 35 deletions source/wrangling.md
Original file line number Diff line number Diff line change
Expand Up @@ -138,8 +138,6 @@ region = pd.Series(["Toronto", "Montreal", "Vancouver", "Calgary", "Ottawa"])
region
```

<!-- **(FIGURE 14 NEEDS UPDATING: (a) ZERO-BASED INDEXING, (b) TYPE SHOULD BE STRING (NOT CHARACTER))** -->

+++ {"tags": []}

```{figure} img/wrangling/pandas_dataframe_series.png
Expand Down Expand Up @@ -194,7 +192,7 @@ It is important in Python to make sure you represent your data with the correct
Many of the `pandas` functions we use in this book treat
the various data types differently. You should use `int` and `float` types
to represent numbers and perform arithmetic. The `int` type is for integers that have no decimal point,
while the `float` type is for numbers that have a decimal point.
while the `float` type is for numbers that have a decimal point.
The `bool` type are boolean variables that can only take on one of two values: `True` or `False`.
The `string` type is used to represent data that should
be thought of as "text", such as words, names, paths, URLs, and more.
Expand Down Expand Up @@ -480,8 +478,6 @@ in the `melt` function to accomplish this data transformation.

+++ {"tags": []}

**(FIGURE UPDATE NEEDED TO MATCH THE CODE BELOW)**

```{figure} img/wrangling/pandas_melt_args_labels.png
:name: fig:img-pivot-longer
:figclass: figure
Expand Down Expand Up @@ -986,7 +982,7 @@ with higher numbers of people who speak it as their primary language at home
compared to French in Montréal, then we can use `[]` to obtain rows
where the value of `most_at_home` is greater than
{glue:text}`most_french`. We use the `>` symbol to look for values *above* a threshold,
and the `<` symbol to look for values *below* a threshold. The `>=` and `<=`
and the `<` symbol to look for values *below* a threshold. The `>=` and `<=`
symbols similarly look for *equal to or above* a threshold and *equal to or below* a threshold.

```{code-cell} ipython3
Expand Down Expand Up @@ -1448,28 +1444,28 @@ so that we can convert them from `int64` to `int32`. We will use what is called
a `lambda` function in python; `lambda` functions are just regular functions,
except that you don't need to give them a name.
That means you can pass them as an argument into `apply` easily!
Let's consider a simple example of a `lambda` function that
Let's consider a simple example of a `lambda` function that
multiplies a number by two.
```{code-cell} ipython3
lambda x: 2*x
```
We define a `lambda` function in the following way. We start with the syntax `lambda`, which is a special word
We define a `lambda` function in the following way. We start with the syntax `lambda`, which is a special word
that tells Python "what follows is
a function." Following this, we then state the name of the arguments of the function.
a function." Following this, we then state the name of the arguments of the function.
In this case, we just have one argument named `x`. After the list of arguments, we put a
colon `:`. And finally after the colon are the instructions: take the value provided and multiply it by 2.
Let's call our shiny new `lambda` function with the argument `2` (so the output should be `4`).
Just like a regular function, we pass its argument between parentheses `()` symbols.
```{code-cell} ipython3
(lambda x: 2*x)(2)
```
> **Note:** Because we didn't give the `lambda` function a name, we have to surround it with
> **Note:** Because we didn't give the `lambda` function a name, we have to surround it with
> parentheses too if we want to call it. Otherwise, if we wrote something like `lambda x: 2*x(2)`, Python would get confused
> and think that `(2)` was part of the instructions that comprise the `lambda` function.
> As long as we don't want to call the `lambda` function ourselves, we don't need those parentheses. For example,
> we can pass a `lambda` function as an argument to `apply` without any parentheses.
> we can pass a `lambda` function as an argument to `apply` without any parentheses.

Returning to our example, let's use `apply` to convert the columns `"mother_tongue":"lang_known"`
Returning to our example, let's use `apply` to convert the columns `"mother_tongue":"lang_known"`
to `int32`. To accomplish this we create a `lambda` function that takes one argument---a single column
of the data frame, which we will name `col`---and apply the `astype` method to it.
Then the `apply` method will use that `lambda` function on every column we specify via `loc[]`.
Expand Down Expand Up @@ -1514,8 +1510,8 @@ region_lang_nums.apply(max, axis=1)

We see that we get a column, which is the maximum value between `mother_tongue`,
`most_at_home`, `most_at_work` and `lang_known` for each language
and region. It is often the case that we want to include a column result
from using `apply` row-wise as a new column in the data frame, so that we can make
and region. It is often the case that we want to include a column result
from using `apply` row-wise as a new column in the data frame, so that we can make
plots or continue our analysis. To make this happen,
we will use `assign` to create a new column. This is discussed in the next section.

Expand All @@ -1540,7 +1536,7 @@ with the new column added to it.

To use the `assign` method, we specify one argument for each column we want to create.
In this case we want to create one new column named `maximum`, so the argument
to `assign` begins with `maximum = `.
to `assign` begins with `maximum = `.
Then after the `=`, we specify what the contents of that new column
should be. In this case we use `apply` just as we did in the previous section to give us the maximum values.
Remember to specify `axis=1` in the `apply` method so that we compute the row-wise maximum value.
Expand All @@ -1550,7 +1546,7 @@ region_lang.assign(
maximum = region_lang_nums.apply(max, axis=1)
)
```
This gives us a new data frame that looks like the `region_lang` data frame,
This gives us a new data frame that looks like the `region_lang` data frame,
except that it has an additional column named `maximum`.
The `maximum` column contains
the maximum value between `mother_tongue`,
Expand All @@ -1575,7 +1571,7 @@ glue("toronto_popn", "{0:,.0f}".format(toronto_popn))
glue("prop_eng_tor", "{0:.2f}".format(number_most_home / toronto_popn))
```

As another example, we might ask the question: "What proportion of
As another example, we might ask the question: "What proportion of
the population reported English as their primary language at home in the 2016 census?"
For example, in Toronto, {glue:text}`number_most_home` people reported
speaking English as their primary language at home, and the
Expand All @@ -1597,43 +1593,39 @@ and name the new data frame `english_langs`.
```{code-cell} ipython3
:tags: ["output_scroll"]
english_lang = region_lang[
(region_lang["language"] == "English") &
(region_lang["region"].isin(five_cities["region"]))
]
(region_lang["language"] == "English") &
(region_lang["region"].isin(five_cities["region"]))
]
english_lang
```

Okay, now we have a data frame that pertains only to the English language
and the five cities mentioned earlier.
and the five cities mentioned earlier.
In order to compute the proportion of the population speaking English in each of these cities,
we need to add the population data from the `five_cities` data frame.
```{code-cell} ipython3
five_cities
```
The data frame above shows that the populations of the five cities in 2016 were
The data frame above shows that the populations of the five cities in 2016 were
5928040 (Toronto), 4098927 (Montréal), 2463431 (Vancouver), 1392609 (Calgary), and 1321426 (Edmonton).
We will add this information to our data frame in a new column named `city_pops` by using `assign`.
Once again we specify the new column name (`city_pops`) as the argument, followed by the equal symbol `=`,
We will add this information to our data frame in a new column named `city_pops` by using `assign`.
Once again we specify the new column name (`city_pops`) as the argument, followed by the equal symbol `=`,
and finally the data in the column.
Note that the order of the rows in the `english_lang` data frame is Montréal, Toronto, Calgary, Edmonton, Vancouver.
So we will create a column called `city_pops` where we list the populations of those cities in that
order, and add it to our data frame.
Also note that we write `english_lang = ` on the left so that the newly created data frame overwrites our
old `english_lang` data frame; remember that by default, like other `pandas` functions, `assign` does not
old `english_lang` data frame; remember that by default, like other `pandas` functions, `assign` does not
modify the original data frame directly!
```{code-cell} ipython3
:tags: ["output_scroll"]
english_lang = english_lang.assign(
city_pops=[4098927,
5928040,
1392609,
1321426,
2463431
])
city_pops=[4098927, 5928040, 1392609, 1321426, 2463431]
)
english_lang
```
> **Note**: Inserting data manually in this is generally very error-prone and is not recommended.
> We do it here to demonstrate another usage of `assign` that does not involve `apply`.
> We do it here to demonstrate another usage of `assign` that does not involve `apply`.
> But in more advanced data wrangling,
> one would solve this problem in a less error-prone way using
> the `merge` function, which lets you combine two data frames. We will show you an
Expand All @@ -1645,8 +1637,8 @@ proportion of people who speak English the most at home by taking the ratio of t
```{code-cell} ipython3
:tags: ["output_scroll"]
english_lang.assign(
proportion=english_lang["most_at_home"]/english_lang["city_pops"]
)
proportion=english_lang["most_at_home"]/english_lang["city_pops"]
)
```


Expand Down Expand Up @@ -1737,7 +1729,7 @@ right order, and it could be easy to make a mistake this way. An alternative app
is to (1) create a new, empty data frame, (2) use `assign` to assign the city names and populations in that
data frame, and (3) use `merge` to combine the two data frames, recognizing that the "regions" are the same.

We create a new, empty data frame by calling `pd.DataFrame` with no arguments.
We create a new, empty data frame by calling `pd.DataFrame` with no arguments.
We then use `assign` to add the city names in a column called `"region"`
and their populations in a column called `"population"`.
```{code-cell} ipython3
Expand Down