Skip to content

Ch3 suggestions #97

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 29, 2023
205 changes: 119 additions & 86 deletions source/wrangling.md
Original file line number Diff line number Diff line change
Expand Up @@ -1014,58 +1014,71 @@ is less often used than the earlier approaches we introduced, but it can come in
to make long chains of filtering operations a bit easier to read.

(loc-iloc)=
## Using `loc[]` to filter rows and select columns.
## Using `loc[]` to filter rows and select columns

```{index} pandas.DataFrame; loc[]
```

The `[]` operation is only used when you want to filter rows or select columns;
The `[]` operation is only used when you want to either filter rows **or** select columns;
Copy link
Contributor

@trevorcampbell trevorcampbell Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit unclear given the below section on using : (where you need to use .loc[] even if you're only doing a range on one var)

Copy link
Contributor

@trevorcampbell trevorcampbell Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh... also complicated by the fact that you can use [] for str.startswith etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use loc with a range we are technically filtering rows too with the :, but I see what you mean regarding that the intention of that operation is just to select a range of columns and the row filtering is a syntacitcal detail. We could rewrite this too "The [] operation is only used when you want to either filter rows [using a boolean expression] or select [a list of columns], but I am not sure if that is too specific.

also complicated by the fact that you can use [] for str.startswith etc

That was a mistake on my part, you can't do that unless you add an extra step of filtering the column names using the boolean array returned from str.startswith so I updated that section to use loc instead.

it cannot be used to do both operations at the same time. This is where `loc[]`
comes in. For the first example, recall `loc[]` from Chapter {ref}`intro`,
which lets us create a subset of columns from a data frame.

```{code-cell} ipython3
:tags: ["output_scroll"]
tidy_lang.loc[
tidy_lang['region'] == 'Toronto',
["language", "region", "most_at_home", "most_at_work"]
]
```

### Using `loc[]` to select ranges of columns

Copy link
Contributor

@trevorcampbell trevorcampbell Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the connecting thread in this section doesn't seem to be the use of loc[], it seems to be about selecting ranges

at a high level, many of the edits in this section make it flow poorly -- lots of jumping between [] and .loc[] when the section is supposed to be about "using `.loc[] to do X, when/why to use .loc[] instead of []"

I think we need to more carefully consider how we want to organize this stuff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. I'm OK with demonstrating various uses of .loc[] that flow better with the text pedagogically even if technically one could use [] for the same purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only jump to [] that I see is when we use starswith and that was a mistake on my part because we should be using loc there. I updated this section, do you think it flows better now?

Suppose we wanted to select only the columns `language`, `region`,
`most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we
learned in the chapter on {ref}`intro`, we would pass all of these column names into the square brackets.

```{code-cell} ipython3
:tags: ["output_scroll"]
selected_columns = tidy_lang.loc[:, ["language", "region", "most_at_home", "most_at_work"]]
selected_columns
tidy_lang[["language", "region", "most_at_home", "most_at_work"]]
```
We pass `:` before the comma indicating we want to retrieve all rows, and the list indicates
the columns that we want.

Note that we could obtain the same result by stating that we would like all of the columns
from `language` through `most_at_work`. Instead of passing a list of all of the column
names that we want, we can ask for the range of columns `"language":"most_at_work"`, which
you can read as "The columns from `language` to `most_at_work`".
This `:`-syntax is supported by the `loc` function,
but not by the `[]`, so we need to switch to using `loc[]` here.

```{code-cell} ipython3
:tags: ["output_scroll"]
selected_columns = tidy_lang.loc[:, "language":"most_at_work"]
selected_columns
tidy_lang.loc[:, "language":"most_at_work"]
```

We pass `:` before the comma indicating we want to retrieve all rows,
i.e. we are not filtering any rows in this expression.
Similarly, you can ask for all of the columns including and after `language` by doing the following

```{code-cell} ipython3
:tags: ["output_scroll"]
selected_columns = tidy_lang.loc[:, "language":]
selected_columns
tidy_lang.loc[:, "language":]
```

By not putting anything after the `:`, python reads this as "from `language` until the last column".
Although the notation for selecting a range using `:` is convienent because less code is required,
By not putting anything after the `:`, Python reads this as "from `language` until the last column".
Although the notation for selecting a range using `:` is convenient because less code is required,
it must be used carefully. If you were to re-order columns or add a column to the data frame, the
output would change. Using a list is more explicit and less prone to potential confusion.

Suppose instead we wanted to extract columns that followed a particular pattern
rather than just selecting a range. For example, let's say we wanted only to select the
columns `most_at_home` and `most_at_work`. There are other functions that allow
us to select variables based on their names. In particular, we can use the `.str.startswith` method
to choose only the columns that start with the word "most":
to choose only the columns that start with the word "most".
Since the `str.starswith` expression returns a list of column names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "starswith"

we can use either `[]` or `loc[]` here.

```{code-cell} ipython3
tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')]
tidy_lang[tidy_lang.columns.str.startswith('most')]
```

```{index} pandas.Series; str.contains
Expand All @@ -1076,46 +1089,43 @@ We could also have chosen the columns containing an underscore `_` by using the
the columns we want contain underscores and the others don't.

```{code-cell} ipython3
tidy_lang.loc[:, tidy_lang.columns.str.contains('_')]
tidy_lang[tidy_lang.columns.str.contains('_')]
```

There are many different functions that help with selecting
variables based on certain criteria.
The additional resources section at the end of this chapter
provides a comprehensive resource on these functions.

```{code-cell} ipython3
:tags: [remove-cell]

# There are many different `select` helpers that select
# variables based on certain criteria.
# The additional resources section at the end of this chapter
# provides a comprehensive resource on `select` helpers.
```

## Using `iloc[]` to extract a range of columns
## Using `iloc[]` to extract rows and columns by position
```{index} pandas.DataFrame; iloc[], column range
```
Another approach for selecting columns is to use `iloc[]`,
which provides the ability to index with integers rather than the names of the columns.
For example, the column names of the `tidy_lang` data frame are
Another approach for selecting rows and columns is to use `iloc[]`,
which provides the ability to index with the position rather than the label of the columns.
For example, the column labels of the `tidy_lang` data frame are
`['category', 'language', 'region', 'most_at_home', 'most_at_work']`.
Using `iloc[]`, you can ask for the `language` column by requesting the
column at index `1` (remember that Python starts counting at `0`, so the second item `'language'`
has index `1`!).

```{code-cell} ipython3
column = tidy_lang.iloc[:, 1]
column
tidy_lang.iloc[:, 1]
```

You can also ask for multiple columns, just like we did with `[]`. We pass `:` before
the comma, indicating we want to retrieve all rows, and `1:` after the comma
You can also ask for multiple columns,
we pass `1:` after the comma
indicating we want columns after and including index 1 (*i.e.* `language`).

```{code-cell} ipython3
column_range = tidy_lang.iloc[:, 1:]
column_range
tidy_lang.iloc[:, 1:]
```

We can also use `iloc[]` to select ranges of rows, using a similar syntax.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to show that rows can be subset too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like something we would want to do in the .loc[] section above too (assuming we didn't already)

and for [] by itself too somewhere, unless we've already done that elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] by itself comes just after iloc below. If I remember correctly, I opted to not include row ranges with loc because then we need to get into explaining named indexes. For example .loc[:5] is only the same as [:5] and .iloc[:5] if the index names of the first five rows are 0,1,2,3,4, since loc always selects by row/index name/label.

For example to select the ten first rows we could use the following:

```{code-cell} ipython3
tidy_lang.iloc[:10, :]
```

`pandas` also provides a shorthand for selecting ranges of rows by using `[]`:

```{code-cell} ipython3
tidy_lang[:10]
```

The `iloc[]` method is less commonly used, and needs to be used with care.
Expand Down Expand Up @@ -1251,52 +1261,44 @@ summary statistics that you can compute with `pandas`.
What if you want to calculate summary statistics on an entire data frame? Well,
it turns out that the functions in {numref}`tab:basic-summary-statistics`
can be applied to a whole data frame!
For example, we can ask for the number of rows that each column has using `count`.
```{code-cell} ipython3
region_lang.count()
Comment on lines -1254 to -1256
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think count is a good example since data frame must have the same number of rows in each column, and we should get that infor from shape or info instead. I think this flows better from the previous section too

```
Not surprisingly, they are all the same. We could also ask for the `mean`, but
some of the columns in `region_lang` contain string data with words like `"Vancouver"`
and `"Halifax"`---for these columns there is no way for `pandas` to compute the mean.
So we provide the keyword `numeric_only=True` so that it only computes the mean of columns with numeric values. This
is also needed if you want the `sum` or `std`.
```{code-cell} ipython3
region_lang.mean(numeric_only=True)
```
If we ask for the `min` or the `max`, `pandas` will give you the smallest or largest number
for columns with numeric values. For columns with text, it will return the
least repeated value for `min` and the most repeated value for `max`. Again,
Comment on lines -1267 to -1268
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this is incorrect; it is alphabetical.

if you only want the minimum and maximum value for
numeric columns, you can provide `numeric_only=True`.
For example, we can ask for the maximum value of each each column has using `max`.

```{code-cell} ipython3
region_lang.max()
```

We can see that for columns that contain string data
with words like `"Vancouver"` and `"Halifax"`,
the maximum value is determined by sorting the string alphabetically
and returning the last value.
If we only want the maximum value for
numeric columns,
we can provide `numeric_only=True`:

```{code-cell} ipython3
region_lang.min()
region_lang.max(numeric_only=True)
```

Similarly, if there are only some columns for which you would like to get summary statistics,
you can first use `loc[]` and then ask for the summary statistic. An example of this is illustrated in {numref}`fig:summarize-across`.
Later, we will talk about how you can also use a more general function, `apply`, to accomplish this.
We could also ask for the `mean` for each columns in the dataframe.
It does not make sense to compute the mean of the string columns,
so in this case we *must* provide the keyword `numeric_only=True`
so that the mean is only computed on columns with numeric values.

```{figure} img/summarize/summarize.003.jpeg
:name: fig:summarize-across
:figclass: figure

`loc[]` or `apply` is useful for efficiently calculating summary statistics on
many columns at once. The darker, top row of each table represents the column
headers.
Comment on lines -1282 to -1288
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is means to say. loc is already explained in the text and apply is the not efficient, it is more inefficient than what we use here. The figure also does not seem to add anything, I don't understand what it is showing myself.

```{code-cell} ipython3
region_lang.mean(numeric_only=True)
```

If there are only some columns for which you would like to get summary statistics,
you can first use `[]` to select those columns
and then ask for the summary statistic,
as we did for a single column previously:
Lets say that we want to know
the mean and standard deviation of all of the columns between `"mother_tongue"` and `"lang_known"`.
We use `loc[]` to specify the columns and then `agg` to ask for both the `mean` and `std`.
We use `[]` to specify the columns and then `agg` to ask for both the `mean` and `std`.
```{code-cell} ipython3
region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"])
region_lang["mother_tongue":"lang_known"].agg(["mean", "std"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, I thought : wasn't allowed in [] as per text above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is a typo and should be region_lang.loc[:, "most_at_home":"most_at_work"] as in the next section.

```



## Performing operations on groups of rows using `groupby`

+++
Expand Down Expand Up @@ -1334,56 +1336,87 @@ The `groupby` function takes at least one argument—the columns to use in t
grouping. Here we use only one column for grouping (`region`).

```{code-cell} ipython3
region_lang.groupby("region")["most_at_home"].agg(["min", "max"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this order of explanation is more helpful, introducing one step at a time

region_lang.groupby("region")
```

Notice that `groupby` converts a `DataFrame` object to a `DataFrameGroupBy`
object, which contains information about the groups of the data frame. We can
then apply aggregating functions to the `DataFrameGroupBy` object. This can be handy if you would like to perform multiple operations and assign
each output to its own object.

```{code-cell} ipython3
region_lang.groupby("region")
region_lang.groupby("region")["most_at_home"].agg(["min", "max"])
```

The resulting dataframe has `region` as an index name.
This is similar to what happened when we reshaped data frames in the previous chapter,
and just as we did then,
you can use `reset_index` to get back to a regular dataframe
with `region` as a column name.

```{code-cell} ipython3
region_lang.groupby("region")["most_at_home"].agg(["min", "max"]).reset_index()
```
Comment on lines +1378 to +1380
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite important to mention here too as we don't cover named indices or multi-indices

You can also pass multiple column names to `groupby`. For example, if we wanted to
know about how the different categories of languages (Aboriginal, Non-Official &
Non-Aboriginal, and Official) are spoken at home in different regions, we would pass a
list including `region` and `category` to `groupby`.

```{code-cell} ipython3
region_lang.groupby(["region", "category"])["most_at_home"].agg(["min", "max"])
```

You can also ask for grouped summary statistics on the whole data frame

```{code-cell} ipython3
:tags: ["output_scroll"]
region_lang.groupby("region").agg(["min", "max"])
```

If you want to ask for only some columns, for example
the columns between `"most_at_home"` and `"lang_known"`,
you might think about first applying `groupby` and then `loc`;
you might think about first applying `groupby` and then `["most_at_home":"lang_known"]`;
but `groupby` returns a `DataFrameGroupBy` object, which does not
work with `loc`. The other option is to do things the other way around:
first use `loc`, then use `groupby`.
This usually does work, but you have to be careful! For example,
in our case, if we try using `loc` and then `groupby`, we get an error.
work with ranges inside `[]`.
The other option is to do things the other way around:
first use `["most_at_home":"lang_known"]`, then use `groupby`.
This can work, but you have to be careful! For example,
in our case, we get an error.

```{code-cell} ipython3
:tags: [remove-output]
region_lang.loc[:, "most_at_home":"lang_known"].groupby("region").max()
region_lang["most_at_home":"lang_known"].groupby("region").max()
```

```
KeyError: 'region'
```
This is because when we use `loc` we selected only the columns between

This is because when we use `[]` we selected only the columns between
`"most_at_home"` and `"lang_known"`, which doesn't include `"region"`!
Instead, we need to call `loc` with a list of column names that
includes `region`, and then use `groupby`.
Instead, we need to use `groupby` first
and then call `[]` with a list of column names that includes `region`;
this approach always works.

```{code-cell} ipython3
:tags: ["output_scroll"]
region_lang.groupby("region")[["most_at_home", "most_at_work", "lang_known"]].max()
```

To see how many observations there are in each group,
we can use `value_counts`.

```{code-cell} ipython3
:tags: ["output_scroll"]
region_lang.value_counts("region")
```

Which takes the `normalize` parameter to show the output as proportion
instead of a count.

```{code-cell} ipython3
:tags: ["output_scroll"]
region_lang.loc[
:,
["region", "mother_tongue", "most_at_home", "most_at_work", "lang_known"]
].groupby("region").max()
region_lang.value_counts("region", normalize=True)
```

+++
Expand Down