Ch3 suggestions #97

joelostblom · 2023-01-05T15:34:58Z

This is my continuation of the review of ch3 I started yesterday. I think we could simplify this chapter quite a lot via #100 (but that's for later maybe).

close #95

Prefered over size for me since it has `normalize`

joelostblom · 2023-01-05T15:36:14Z

source/wrangling.md

+tidy_lang.iloc[:, 1:]
+```
+
+We can also use `iloc[]` to select ranges of rows, using a similar syntax.


I added this to show that rows can be subset too

seems like something we would want to do in the .loc[] section above too (assuming we didn't already)

and for [] by itself too somewhere, unless we've already done that elsewhere

[] by itself comes just after iloc below. If I remember correctly, I opted to not include row ranges with loc because then we need to get into explaining named indexes. For example .loc[:5] is only the same as [:5] and .iloc[:5] if the index names of the first five rows are 0,1,2,3,4, since loc always selects by row/index name/label.

joelostblom · 2023-01-05T15:36:56Z

source/wrangling.md

-For example, we can ask for the number of rows that each column has using `count`.
-```{code-cell} ipython3
-region_lang.count()


I don't think count is a good example since data frame must have the same number of rows in each column, and we should get that infor from shape or info instead. I think this flows better from the previous section too

joelostblom · 2023-01-05T15:37:23Z

source/wrangling.md

-for columns with numeric values. For columns with text, it will return the
-least repeated value for `min` and the most repeated value for `max`. Again,


I believe that this is incorrect; it is alphabetical.

joelostblom · 2023-01-05T15:39:22Z

source/wrangling.md

-```{figure} img/summarize/summarize.003.jpeg
-:name: fig:summarize-across
-:figclass: figure
-
-`loc[]` or `apply` is useful for efficiently calculating summary statistics on
-many columns at once. The darker, top row of each table represents the column
-headers.


I don't understand what this is means to say. loc is already explained in the text and apply is the not efficient, it is more inefficient than what we use here. The figure also does not seem to add anything, I don't understand what it is showing myself.

joelostblom · 2023-01-05T16:04:40Z

source/wrangling.md

@@ -1334,56 +1341,71 @@ The `groupby` function takes at least one argument&mdash;the columns to use in t
 grouping. Here we use only one column for grouping (`region`).

 ```{code-cell} ipython3
-region_lang.groupby("region")["most_at_home"].agg(["min", "max"])


I think this order of explanation is more helpful, introducing one step at a time

joelostblom · 2023-01-05T16:05:00Z

source/wrangling.md

+```{code-cell} ipython3
+region_lang.groupby("region")["most_at_home"].agg(["min", "max"]).reset_index()
+```


I think this is quite important to mention here too as we don't cover named indices or multi-indices

joelostblom · 2023-01-05T16:05:43Z

source/wrangling.md

+Instead, we need to use `groupby` first
+and then call `[]` with a list of column names that includes `region`;
+this approach always works.
+
 ```{code-cell} ipython3
 :tags: ["output_scroll"]
-region_lang.loc[
-  :,
-  ["region", "mother_tongue", "most_at_home", "most_at_work", "lang_known"]
-].groupby("region").max()
+region_lang.groupby("region")[["most_at_home", "lang_known"]].max()


I think this is another example of why we should prefer [] over loc whenever possible.

trevorcampbell

I put in these comments after just reading the diff. Usually that's sufficient, but in this case I think I need to do another pass on this with more comments after reading both the compiled original and modified chapter

trevorcampbell · 2023-07-11T18:51:40Z

source/wrangling.md

 it must be used carefully. If you were to re-order columns or add a column to the data frame, the
 output would change. Using a list is more explicit and less prone to potential confusion.

 Suppose instead we wanted to extract columns that followed a particular pattern
 rather than just selecting a range. For example, let's say we wanted only to select the
 columns `most_at_home` and `most_at_work`. There are other functions that allow
 us to select variables based on their names. In particular, we can use the `.str.startswith` method
-to choose only the columns that start with the word "most":
+to choose only the columns that start with the word "most".
+Since the `str.starswith` expression returns a list of column names


typo "starswith"

trevorcampbell · 2023-07-11T18:55:42Z

source/wrangling.md

 ```{index} pandas.DataFrame; loc[]
 ```

-The `[]` operation is only used when you want to filter rows or select columns;
+The `[]` operation is only used when you want to either filter rows **or** select columns;


bit unclear given the below section on using : (where you need to use .loc[] even if you're only doing a range on one var)

ugh... also complicated by the fact that you can use [] for str.startswith etc.

When we use loc with a range we are technically filtering rows too with the :, but I see what you mean regarding that the intention of that operation is just to select a range of columns and the row filtering is a syntacitcal detail. We could rewrite this too "The [] operation is only used when you want to either filter rows [using a boolean expression] or select [a list of columns], but I am not sure if that is too specific.

also complicated by the fact that you can use [] for str.startswith etc

That was a mistake on my part, you can't do that unless you add an extra step of filtering the column names using the boolean array returned from str.startswith so I updated that section to use loc instead.

trevorcampbell · 2023-07-11T18:56:47Z

source/wrangling.md

+```
+
+### Using `loc[]` to select ranges of columns
+


the connecting thread in this section doesn't seem to be the use of loc[], it seems to be about selecting ranges

at a high level, many of the edits in this section make it flow poorly -- lots of jumping between [] and .loc[] when the section is supposed to be about "using `.loc[] to do X, when/why to use .loc[] instead of []"

I think we need to more carefully consider how we want to organize this stuff

e.g. I'm OK with demonstrating various uses of .loc[] that flow better with the text pedagogically even if technically one could use [] for the same purpose.

The only jump to [] that I see is when we use starswith and that was a mistake on my part because we should be using loc there. I updated this section, do you think it flows better now?

trevorcampbell · 2023-07-11T19:03:11Z

source/wrangling.md

+tidy_lang.iloc[:, 1:]
+```
+
+We can also use `iloc[]` to select ranges of rows, using a similar syntax.


seems like something we would want to do in the .loc[] section above too (assuming we didn't already)

and for [] by itself too somewhere, unless we've already done that elsewhere

trevorcampbell · 2023-07-11T19:04:47Z

source/wrangling.md

 ```{code-cell} ipython3
-region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"])
+region_lang["mother_tongue":"lang_known"].agg(["mean", "std"])


wait, I thought : wasn't allowed in [] as per text above

Good catch, this is a typo and should be region_lang.loc[:, "most_at_home":"most_at_work"] as in the next section.

joelostblom · 2023-07-27T00:34:47Z

@trevorcampbell a ping here just because I have started on the first worksheet and tutorial and some of the changes there depend on which decision we take here

trevorcampbell · 2023-07-27T17:53:01Z

OK, I will look at this today at some point!

joelostblom added 5 commits January 5, 2023 16:34

Align explanation of loc and iloc with the intro chapter

3f42884

Explain aggregations more intuitively

2d5aac7

Remove loc from groupby section and simplify it

dc73391

Add mention of value counts for group sizes

a7e6a4e

Prefered over size for me since it has `normalize`

Note that [] cannot be used for ranges and we need loc[] for that

438c9f2

joelostblom commented Jan 5, 2023

View reviewed changes

joelostblom requested review from lheagy and trevorcampbell and removed request for lheagy January 5, 2023 18:12

joelostblom marked this pull request as ready for review January 5, 2023 18:12

trevorcampbell reviewed Jul 11, 2023

View reviewed changes

joelostblom added 2 commits July 13, 2023 11:18

Update startswith with the correct explanation

464a21f

Fix typo

87b7584

joelostblom requested a review from trevorcampbell July 13, 2023 18:26

trevorcampbell mentioned this pull request Jul 27, 2023

Polishing flow on the Ch3 PR #186

Merged

tc polish on wrangling (#186)

28608ab

joelostblom merged commit 5ce18b6 into main Jul 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ch3 suggestions #97

Ch3 suggestions #97

joelostblom commented Jan 5, 2023 •

edited

Loading

joelostblom Jan 5, 2023

trevorcampbell Jul 11, 2023

joelostblom Jul 13, 2023

joelostblom Jan 5, 2023

joelostblom Jan 5, 2023

joelostblom Jan 5, 2023

joelostblom Jan 5, 2023

joelostblom Jan 5, 2023

joelostblom Jan 5, 2023

trevorcampbell left a comment

trevorcampbell Jul 11, 2023

trevorcampbell Jul 11, 2023 •

edited

Loading

trevorcampbell Jul 11, 2023 •

edited

Loading

joelostblom Jul 13, 2023

trevorcampbell Jul 11, 2023 •

edited

Loading

trevorcampbell Jul 11, 2023

joelostblom Jul 13, 2023

trevorcampbell Jul 11, 2023

trevorcampbell Jul 11, 2023

joelostblom Jul 13, 2023

joelostblom commented Jul 27, 2023

trevorcampbell commented Jul 27, 2023

		for columns with numeric values. For columns with text, it will return the
		least repeated value for `min` and the most repeated value for `max`. Again,

Ch3 suggestions #97

Ch3 suggestions #97

Conversation

joelostblom commented Jan 5, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorcampbell left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorcampbell Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

trevorcampbell Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

trevorcampbell Jul 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joelostblom commented Jul 27, 2023

trevorcampbell commented Jul 27, 2023

joelostblom commented Jan 5, 2023 •

edited

Loading

trevorcampbell Jul 11, 2023 •

edited

Loading

trevorcampbell Jul 11, 2023 •

edited

Loading

trevorcampbell Jul 11, 2023 •

edited

Loading