-
Notifications
You must be signed in to change notification settings - Fork 15
Ch3 suggestions #97
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ch3 suggestions #97
Conversation
Prefered over size for me since it has `normalize`
source/wrangling.md
Outdated
tidy_lang.iloc[:, 1:] | ||
``` | ||
|
||
We can also use `iloc[]` to select ranges of rows, using a similar syntax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this to show that rows can be subset too
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like something we would want to do in the .loc[]
section above too (assuming we didn't already)
and for []
by itself too somewhere, unless we've already done that elsewhere
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[]
by itself comes just after iloc
below. If I remember correctly, I opted to not include row ranges with loc
because then we need to get into explaining named indexes. For example .loc[:5]
is only the same as [:5]
and .iloc[:5]
if the index names of the first five rows are 0,1,2,3,4, since loc
always selects by row/index name/label.
For example, we can ask for the number of rows that each column has using `count`. | ||
```{code-cell} ipython3 | ||
region_lang.count() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think count
is a good example since data frame must have the same number of rows in each column, and we should get that infor from shape or info instead. I think this flows better from the previous section too
for columns with numeric values. For columns with text, it will return the | ||
least repeated value for `min` and the most repeated value for `max`. Again, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this is incorrect; it is alphabetical.
```{figure} img/summarize/summarize.003.jpeg | ||
:name: fig:summarize-across | ||
:figclass: figure | ||
|
||
`loc[]` or `apply` is useful for efficiently calculating summary statistics on | ||
many columns at once. The darker, top row of each table represents the column | ||
headers. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what this is means to say. loc
is already explained in the text and apply
is the not efficient, it is more inefficient than what we use here. The figure also does not seem to add anything, I don't understand what it is showing myself.
@@ -1334,56 +1341,71 @@ The `groupby` function takes at least one argument—the columns to use in t | |||
grouping. Here we use only one column for grouping (`region`). | |||
|
|||
```{code-cell} ipython3 | |||
region_lang.groupby("region")["most_at_home"].agg(["min", "max"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this order of explanation is more helpful, introducing one step at a time
```{code-cell} ipython3 | ||
region_lang.groupby("region")["most_at_home"].agg(["min", "max"]).reset_index() | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is quite important to mention here too as we don't cover named indices or multi-indices
source/wrangling.md
Outdated
Instead, we need to use `groupby` first | ||
and then call `[]` with a list of column names that includes `region`; | ||
this approach always works. | ||
|
||
```{code-cell} ipython3 | ||
:tags: ["output_scroll"] | ||
region_lang.loc[ | ||
:, | ||
["region", "mother_tongue", "most_at_home", "most_at_work", "lang_known"] | ||
].groupby("region").max() | ||
region_lang.groupby("region")[["most_at_home", "lang_known"]].max() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is another example of why we should prefer []
over loc
whenever possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put in these comments after just reading the diff. Usually that's sufficient, but in this case I think I need to do another pass on this with more comments after reading both the compiled original and modified chapter
source/wrangling.md
Outdated
it must be used carefully. If you were to re-order columns or add a column to the data frame, the | ||
output would change. Using a list is more explicit and less prone to potential confusion. | ||
|
||
Suppose instead we wanted to extract columns that followed a particular pattern | ||
rather than just selecting a range. For example, let's say we wanted only to select the | ||
columns `most_at_home` and `most_at_work`. There are other functions that allow | ||
us to select variables based on their names. In particular, we can use the `.str.startswith` method | ||
to choose only the columns that start with the word "most": | ||
to choose only the columns that start with the word "most". | ||
Since the `str.starswith` expression returns a list of column names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo "starswith"
```{index} pandas.DataFrame; loc[] | ||
``` | ||
|
||
The `[]` operation is only used when you want to filter rows or select columns; | ||
The `[]` operation is only used when you want to either filter rows **or** select columns; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bit unclear given the below section on using :
(where you need to use .loc[]
even if you're only doing a range on one var)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ugh... also complicated by the fact that you can use []
for str.startswith
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When we use loc
with a range we are technically filtering rows too with the :
, but I see what you mean regarding that the intention of that operation is just to select a range of columns and the row filtering is a syntacitcal detail. We could rewrite this too "The []
operation is only used when you want to either filter rows [using a boolean expression] or select [a list of columns], but I am not sure if that is too specific.
also complicated by the fact that you can use [] for str.startswith etc
That was a mistake on my part, you can't do that unless you add an extra step of filtering the column names using the boolean array returned from str.startswith
so I updated that section to use loc
instead.
source/wrangling.md
Outdated
``` | ||
|
||
### Using `loc[]` to select ranges of columns | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the connecting thread in this section doesn't seem to be the use of loc[]
, it seems to be about selecting ranges
at a high level, many of the edits in this section make it flow poorly -- lots of jumping between []
and .loc[]
when the section is supposed to be about "using `.loc[] to do X, when/why to use .loc[] instead of []"
I think we need to more carefully consider how we want to organize this stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e.g. I'm OK with demonstrating various uses of .loc[]
that flow better with the text pedagogically even if technically one could use []
for the same purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only jump to []
that I see is when we use starswith
and that was a mistake on my part because we should be using loc
there. I updated this section, do you think it flows better now?
source/wrangling.md
Outdated
tidy_lang.iloc[:, 1:] | ||
``` | ||
|
||
We can also use `iloc[]` to select ranges of rows, using a similar syntax. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like something we would want to do in the .loc[]
section above too (assuming we didn't already)
and for []
by itself too somewhere, unless we've already done that elsewhere
source/wrangling.md
Outdated
```{code-cell} ipython3 | ||
region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"]) | ||
region_lang["mother_tongue":"lang_known"].agg(["mean", "std"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, I thought :
wasn't allowed in []
as per text above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this is a typo and should be region_lang.loc[:, "most_at_home":"most_at_work"]
as in the next section.
@trevorcampbell a ping here just because I have started on the first worksheet and tutorial and some of the changes there depend on which decision we take here |
OK, I will look at this today at some point! |
This is my continuation of the review of ch3 I started yesterday. I think we could simplify this chapter quite a lot via #100 (but that's for later maybe).
close #95