-
Notifications
You must be signed in to change notification settings - Fork 15
Polishing flow on the Ch3 PR #186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
one can use in the `[]` to select subsets of rows. | ||
|
||
+++ | ||
|
||
### Extracting columns by name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reasoning for this edit: This is essentially a repeat of material from Ch1. But we do the same in the R version, and it doesn't hurt to have it here, especially if someone is going to search for how to subset rows/cols and end up in Ch 3 (maybe missing Ch 1 entirely). The section is named "extract rows or columns", so a bit odd not to discuss columns at all.
Suppose we wanted to select only the columns `language`, `region`, | ||
`most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we | ||
learned in the chapter on {ref}`intro`, we would pass all of these column names into the square brackets. | ||
In addition to simultaneous subsetting of rows and columns, `loc[]` has two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joelostblom I think what was missing before was an emphasis on why we would care about loc[]. I've written it here to reference two special abilities of loc beyond [] -- ranges and logical statements to select columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this additional motivation!
|
||
```{code-cell} ipython3 | ||
:tags: ["output_scroll"] | ||
tidy_lang[["language", "region", "most_at_home", "most_at_work"]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was weird to have the first example in the "using loc" section be []
. Now the first example is a repeat of what we did in Ch 1 (basic usage), followed by "here's what else you can do with loc"
|
||
```{code-cell} ipython3 | ||
region_lang.groupby("region")["most_at_home"].agg(["min", "max"]) | ||
``` | ||
|
||
The resulting dataframe has `region` as an index name. | ||
This is similar to what happened when we reshaped data frames in the previous chapter, | ||
This is similar to what happened when we used the `pivot` function | ||
in the section on {ref}`pivot-wider`; | ||
and just as we did then, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's not in the previous chapter -- it's in the pivot section
the mean and standard deviation of all of the columns between `"mother_tongue"` and `"lang_known"`. | ||
We use `[]` to specify the columns and then `agg` to ask for both the `mean` and `std`. | ||
you can first use `[]` or `.loc[]` to select those columns, | ||
and then ask for the summary statistic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your example right after used .loc[]
so I added that to the text
|
||
```{code-cell} ipython3 | ||
tidy_lang[:10] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got rid of this shorthand example entirely. I think it will confuse students with the []
operator earlier. They certainly might (will eventually) see it in the wild, but I think the priority here is to make sure everyone is understanding what we teach and to keep things contained, and not necessarily to cover every possible thing they'll see beyond the class
|
||
```{code-cell} ipython3 | ||
:tags: ["output_scroll"] | ||
tidy_lang.loc[:, :"language"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured including a "before" example would be helpful here too -- this is a bit weird syntax, so being a bit more explicit is useful
@joelostblom this PR edits your other PR -- if you are happy with these edits, we can merge this one and then merge your PR into |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, thanks for making these changes!
Suppose we wanted to select only the columns `language`, `region`, | ||
`most_at_home` and `most_at_work` from the `tidy_lang` data set. Using what we | ||
learned in the chapter on {ref}`intro`, we would pass all of these column names into the square brackets. | ||
In addition to simultaneous subsetting of rows and columns, `loc[]` has two |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this additional motivation!
* Align explanation of loc and iloc with the intro chapter * Explain aggregations more intuitively * Remove loc from groupby section and simplify it * Add mention of value counts for group sizes Prefered over size for me since it has `normalize` * Note that [] cannot be used for ranges and we need loc[] for that * Update startswith with the correct explanation * Fix typo * tc polish on wrangling (#186) --------- Co-authored-by: Trevor Campbell <[email protected]>
This is an edit on #97 -- since I made quite a few changes, I decided to open a separate PR that will be merged into that PR itself prior to going into
main
.