-
Notifications
You must be signed in to change notification settings - Fork 15
Ch3 suggestions #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ch3 suggestions #97
Changes from 2 commits
3f42884
2d5aac7
dc73391
a7e6a4e
438c9f2
464a21f
87b7584
28608ab
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -1013,26 +1013,16 @@ The query (criteria we are using to select values) is input as a string. The `qu | |
is less often used than the earlier approaches we introduced, but it can come in handy | ||
to make long chains of filtering operations a bit easier to read. | ||
|
||
(loc-iloc)= | ||
## Using `loc[]` to filter rows and select columns. | ||
```{index} pandas.DataFrame; loc[] | ||
``` | ||
## Using `[]` to select ranges of columns | ||
|
||
The `[]` operation is only used when you want to filter rows or select columns; | ||
it cannot be used to do both operations at the same time. This is where `loc[]` | ||
comes in. For the first example, recall `loc[]` from Chapter {ref}`intro`, | ||
which lets us create a subset of columns from a data frame. | ||
Suppose we wanted to select only the columns `language`, `region`, | ||
`most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we | ||
learned in the chapter on {ref}`intro`, we would pass all of these column names into the square brackets. | ||
|
||
```{code-cell} ipython3 | ||
:tags: ["output_scroll"] | ||
selected_columns = tidy_lang.loc[:, ["language", "region", "most_at_home", "most_at_work"]] | ||
selected_columns | ||
tidy_lang[["language", "region", "most_at_home", "most_at_work"]] | ||
``` | ||
We pass `:` before the comma indicating we want to retrieve all rows, and the list indicates | ||
the columns that we want. | ||
|
||
Note that we could obtain the same result by stating that we would like all of the columns | ||
from `language` through `most_at_work`. Instead of passing a list of all of the column | ||
|
@@ -1041,20 +1031,18 @@ you can read as "The columns from `language` to `most_at_work`". | |
|
||
```{code-cell} ipython3 | ||
:tags: ["output_scroll"] | ||
selected_columns = tidy_lang.loc[:, "language":"most_at_work"] | ||
selected_columns | ||
tidy_lang["language":"most_at_work"] | ||
``` | ||
|
||
Similarly, you can ask for all of the columns including and after `language` by doing the following | ||
|
||
```{code-cell} ipython3 | ||
:tags: ["output_scroll"] | ||
selected_columns = tidy_lang.loc[:, "language":] | ||
selected_columns | ||
tidy_lang["language":] | ||
``` | ||
|
||
By not putting anything after the `:`, python reads this as "from `language` until the last column". | ||
Although the notation for selecting a range using `:` is convienent because less code is required, | ||
By not putting anything after the `:`, Python reads this as "from `language` until the last column". | ||
Although the notation for selecting a range using `:` is convenient because less code is required, | ||
it must be used carefully. If you were to re-order columns or add a column to the data frame, the | ||
output would change. Using a list is more explicit and less prone to potential confusion. | ||
|
||
|
@@ -1065,7 +1053,7 @@ us to select variables based on their names. In particular, we can use the `.str | |
to choose only the columns that start with the word "most": | ||
|
||
```{code-cell} ipython3 | ||
tidy_lang.loc[:, tidy_lang.columns.str.startswith('most')] | ||
tidy_lang[tidy_lang.columns.str.startswith('most')] | ||
``` | ||
|
||
```{index} pandas.Series; str.contains | ||
|
@@ -1076,46 +1064,73 @@ We could also have chosen the columns containing an underscore `_` by using the | |
the columns we want contain underscores and the others don't. | ||
|
||
```{code-cell} ipython3 | ||
tidy_lang.loc[:, tidy_lang.columns.str.contains('_')] | ||
tidy_lang[tidy_lang.columns.str.contains('_')] | ||
``` | ||
|
||
(loc-iloc)= | ||
## Using `loc[]` to filter rows and select columns | ||
|
||
```{index} pandas.DataFrame; loc[] | ||
``` | ||
|
||
There are many different functions that help with selecting | ||
variables based on certain criteria. | ||
The additional resources section at the end of this chapter | ||
provides a comprehensive resource on these functions. | ||
The `[]` operation is only used when you want to either filter rows **or** select columns; | ||
it cannot be used to do both operations at the same time. This is where `loc[]` | ||
comes in. For the first example, recall `loc[]` from Chapter {ref}`intro`, | ||
which lets us create a subset of columns from a data frame. | ||
|
||
```{code-cell} ipython3 | ||
:tags: [remove-cell] | ||
:tags: ["output_scroll"] | ||
tidy_lang.loc[ | ||
tidy_lang['region'] == 'Toronto', | ||
["language", "region", "most_at_home", "most_at_work"] | ||
] | ||
``` | ||
|
||
Just as `[]`, `loc` also works with ranges of columns: | ||
|
||
# There are many different `select` helpers that select | ||
# variables based on certain criteria. | ||
# The additional resources section at the end of this chapter | ||
# provides a comprehensive resource on `select` helpers. | ||
```{code-cell} ipython3 | ||
:tags: ["output_scroll"] | ||
tidy_lang.loc[ | ||
tidy_lang['region'] == 'Toronto', | ||
"language":"most_at_work" | ||
] | ||
``` | ||
|
||
## Using `iloc[]` to extract a range of columns | ||
## Using `iloc[]` to extract rows and columns by position | ||
```{index} pandas.DataFrame; iloc[], column range | ||
``` | ||
Another approach for selecting columns is to use `iloc[]`, | ||
which provides the ability to index with integers rather than the names of the columns. | ||
For example, the column names of the `tidy_lang` data frame are | ||
Another approach for selecting rows and columns is to use `iloc[]`, | ||
which provides the ability to index with the position rather than the label of the columns. | ||
For example, the column labels of the `tidy_lang` data frame are | ||
`['category', 'language', 'region', 'most_at_home', 'most_at_work']`. | ||
Using `iloc[]`, you can ask for the `language` column by requesting the | ||
column at index `1` (remember that Python starts counting at `0`, so the second item `'language'` | ||
has index `1`!). | ||
|
||
```{code-cell} ipython3 | ||
column = tidy_lang.iloc[:, 1] | ||
column | ||
tidy_lang.iloc[:, 1] | ||
``` | ||
|
||
You can also ask for multiple columns, just like we did with `[]`. We pass `:` before | ||
the comma, indicating we want to retrieve all rows, and `1:` after the comma | ||
We pass `:` before the comma indicating we want to retrieve all rows. | ||
You can also ask for multiple columns, | ||
we pass `1:` after the comma | ||
indicating we want columns after and including index 1 (*i.e.* `language`). | ||
|
||
```{code-cell} ipython3 | ||
column_range = tidy_lang.iloc[:, 1:] | ||
column_range | ||
tidy_lang.iloc[:, 1:] | ||
``` | ||
|
||
We can also use `iloc[]` to select ranges of rows, using a similar syntax. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added this to show that rows can be subset too There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. seems like something we would want to do in the and for There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
|
||
For example to select the ten first rows we could use the following: | ||
|
||
```{code-cell} ipython3 | ||
tidy_lang.iloc[:10, :] | ||
``` | ||
|
||
`pandas` also provides a shorthand for selecting ranges of rows by using `[]`: | ||
|
||
```{code-cell} ipython3 | ||
tidy_lang[:10] | ||
``` | ||
|
||
The `iloc[]` method is less commonly used, and needs to be used with care. | ||
|
@@ -1251,52 +1266,44 @@ summary statistics that you can compute with `pandas`. | |
What if you want to calculate summary statistics on an entire data frame? Well, | ||
it turns out that the functions in {numref}`tab:basic-summary-statistics` | ||
can be applied to a whole data frame! | ||
For example, we can ask for the number of rows that each column has using `count`. | ||
```{code-cell} ipython3 | ||
region_lang.count() | ||
Comment on lines
-1254
to
-1256
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think |
||
``` | ||
Not surprisingly, they are all the same. We could also ask for the `mean`, but | ||
some of the columns in `region_lang` contain string data with words like `"Vancouver"` | ||
and `"Halifax"`---for these columns there is no way for `pandas` to compute the mean. | ||
So we provide the keyword `numeric_only=True` so that it only computes the mean of columns with numeric values. This | ||
is also needed if you want the `sum` or `std`. | ||
```{code-cell} ipython3 | ||
region_lang.mean(numeric_only=True) | ||
``` | ||
If we ask for the `min` or the `max`, `pandas` will give you the smallest or largest number | ||
for columns with numeric values. For columns with text, it will return the | ||
least repeated value for `min` and the most repeated value for `max`. Again, | ||
Comment on lines
-1267
to
-1268
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I believe that this is incorrect; it is alphabetical. |
||
if you only want the minimum and maximum value for | ||
numeric columns, you can provide `numeric_only=True`. | ||
For example, we can ask for the maximum value of each each column has using `max`. | ||
|
||
```{code-cell} ipython3 | ||
region_lang.max() | ||
``` | ||
|
||
We can see that for columns that contain string data | ||
with words like `"Vancouver"` and `"Halifax"`, | ||
the maximum value is determined by sorting the string alphabetically | ||
and returning the last value. | ||
If we only want the maximum value for | ||
numeric columns, | ||
we can provide `numeric_only=True`: | ||
|
||
```{code-cell} ipython3 | ||
region_lang.min() | ||
region_lang.max(numeric_only=True) | ||
``` | ||
|
||
Similarly, if there are only some columns for which you would like to get summary statistics, | ||
you can first use `loc[]` and then ask for the summary statistic. An example of this is illustrated in {numref}`fig:summarize-across`. | ||
Later, we will talk about how you can also use a more general function, `apply`, to accomplish this. | ||
We could also ask for the `mean` for each columns in the dataframe. | ||
It does not make sense to compute the mean of the string columns, | ||
so in this case we *must* provide the keyword `numeric_only=True` | ||
so that the mean is only computed on columns with numeric values. | ||
|
||
```{figure} img/summarize/summarize.003.jpeg | ||
:name: fig:summarize-across | ||
:figclass: figure | ||
|
||
`loc[]` or `apply` is useful for efficiently calculating summary statistics on | ||
many columns at once. The darker, top row of each table represents the column | ||
headers. | ||
Comment on lines
-1282
to
-1288
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't understand what this is means to say. |
||
```{code-cell} ipython3 | ||
region_lang.mean(numeric_only=True) | ||
``` | ||
|
||
If there are only some columns for which you would like to get summary statistics, | ||
you can first use `[]` to select those columns | ||
and then ask for the summary statistic, | ||
as we did for a single column previously: | ||
Lets say that we want to know | ||
the mean and standard deviation of all of the columns between `"mother_tongue"` and `"lang_known"`. | ||
We use `loc[]` to specify the columns and then `agg` to ask for both the `mean` and `std`. | ||
We use `[]` to specify the columns and then `agg` to ask for both the `mean` and `std`. | ||
```{code-cell} ipython3 | ||
region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"]) | ||
region_lang["mother_tongue":"lang_known"].agg(["mean", "std"]) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wait, I thought There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Good catch, this is a typo and should be |
||
``` | ||
|
||
|
||
|
||
## Performing operations on groups of rows using `groupby` | ||
|
||
+++ | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the connecting thread in this section doesn't seem to be the use of
loc[]
, it seems to be about selecting rangesat a high level, many of the edits in this section make it flow poorly -- lots of jumping between
[]
and.loc[]
when the section is supposed to be about "using `.loc[] to do X, when/why to use .loc[] instead of []"I think we need to more carefully consider how we want to organize this stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. I'm OK with demonstrating various uses of
.loc[]
that flow better with the text pedagogically even if technically one could use[]
for the same purpose.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only jump to
[]
that I see is when we usestarswith
and that was a mistake on my part because we should be usingloc
there. I updated this section, do you think it flows better now?