Skip to content

Ch3 suggestions #97

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Jul 29, 2023
Merged

Ch3 suggestions #97

merged 8 commits into from
Jul 29, 2023

Conversation

joelostblom
Copy link
Contributor

@joelostblom joelostblom commented Jan 5, 2023

This is my continuation of the review of ch3 I started yesterday. I think we could simplify this chapter quite a lot via #100 (but that's for later maybe).

close #95

tidy_lang.iloc[:, 1:]
```

We can also use `iloc[]` to select ranges of rows, using a similar syntax.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this to show that rows can be subset too

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like something we would want to do in the .loc[] section above too (assuming we didn't already)

and for [] by itself too somewhere, unless we've already done that elsewhere

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[] by itself comes just after iloc below. If I remember correctly, I opted to not include row ranges with loc because then we need to get into explaining named indexes. For example .loc[:5] is only the same as [:5] and .iloc[:5] if the index names of the first five rows are 0,1,2,3,4, since loc always selects by row/index name/label.

Comment on lines -1254 to -1256
For example, we can ask for the number of rows that each column has using `count`.
```{code-cell} ipython3
region_lang.count()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think count is a good example since data frame must have the same number of rows in each column, and we should get that infor from shape or info instead. I think this flows better from the previous section too

Comment on lines -1267 to -1268
for columns with numeric values. For columns with text, it will return the
least repeated value for `min` and the most repeated value for `max`. Again,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this is incorrect; it is alphabetical.

Comment on lines -1282 to -1288
```{figure} img/summarize/summarize.003.jpeg
:name: fig:summarize-across
:figclass: figure

`loc[]` or `apply` is useful for efficiently calculating summary statistics on
many columns at once. The darker, top row of each table represents the column
headers.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what this is means to say. loc is already explained in the text and apply is the not efficient, it is more inefficient than what we use here. The figure also does not seem to add anything, I don't understand what it is showing myself.

@@ -1334,56 +1341,71 @@ The `groupby` function takes at least one argument—the columns to use in t
grouping. Here we use only one column for grouping (`region`).

```{code-cell} ipython3
region_lang.groupby("region")["most_at_home"].agg(["min", "max"])
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this order of explanation is more helpful, introducing one step at a time

Comment on lines +1362 to +1364
```{code-cell} ipython3
region_lang.groupby("region")["most_at_home"].agg(["min", "max"]).reset_index()
```
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is quite important to mention here too as we don't cover named indices or multi-indices

Comment on lines 1402 to 1408
Instead, we need to use `groupby` first
and then call `[]` with a list of column names that includes `region`;
this approach always works.

```{code-cell} ipython3
:tags: ["output_scroll"]
region_lang.loc[
:,
["region", "mother_tongue", "most_at_home", "most_at_work", "lang_known"]
].groupby("region").max()
region_lang.groupby("region")[["most_at_home", "lang_known"]].max()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is another example of why we should prefer [] over loc whenever possible.

@joelostblom joelostblom requested review from lheagy and trevorcampbell and removed request for lheagy January 5, 2023 18:12
@joelostblom joelostblom marked this pull request as ready for review January 5, 2023 18:12
Copy link
Contributor

@trevorcampbell trevorcampbell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put in these comments after just reading the diff. Usually that's sufficient, but in this case I think I need to do another pass on this with more comments after reading both the compiled original and modified chapter

it must be used carefully. If you were to re-order columns or add a column to the data frame, the
output would change. Using a list is more explicit and less prone to potential confusion.

Suppose instead we wanted to extract columns that followed a particular pattern
rather than just selecting a range. For example, let's say we wanted only to select the
columns `most_at_home` and `most_at_work`. There are other functions that allow
us to select variables based on their names. In particular, we can use the `.str.startswith` method
to choose only the columns that start with the word "most":
to choose only the columns that start with the word "most".
Since the `str.starswith` expression returns a list of column names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo "starswith"

```{index} pandas.DataFrame; loc[]
```

The `[]` operation is only used when you want to filter rows or select columns;
The `[]` operation is only used when you want to either filter rows **or** select columns;
Copy link
Contributor

@trevorcampbell trevorcampbell Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bit unclear given the below section on using : (where you need to use .loc[] even if you're only doing a range on one var)

Copy link
Contributor

@trevorcampbell trevorcampbell Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ugh... also complicated by the fact that you can use [] for str.startswith etc.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we use loc with a range we are technically filtering rows too with the :, but I see what you mean regarding that the intention of that operation is just to select a range of columns and the row filtering is a syntacitcal detail. We could rewrite this too "The [] operation is only used when you want to either filter rows [using a boolean expression] or select [a list of columns], but I am not sure if that is too specific.

also complicated by the fact that you can use [] for str.startswith etc

That was a mistake on my part, you can't do that unless you add an extra step of filtering the column names using the boolean array returned from str.startswith so I updated that section to use loc instead.

```

### Using `loc[]` to select ranges of columns

Copy link
Contributor

@trevorcampbell trevorcampbell Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the connecting thread in this section doesn't seem to be the use of loc[], it seems to be about selecting ranges

at a high level, many of the edits in this section make it flow poorly -- lots of jumping between [] and .loc[] when the section is supposed to be about "using `.loc[] to do X, when/why to use .loc[] instead of []"

I think we need to more carefully consider how we want to organize this stuff

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g. I'm OK with demonstrating various uses of .loc[] that flow better with the text pedagogically even if technically one could use [] for the same purpose.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only jump to [] that I see is when we use starswith and that was a mistake on my part because we should be using loc there. I updated this section, do you think it flows better now?

tidy_lang.iloc[:, 1:]
```

We can also use `iloc[]` to select ranges of rows, using a similar syntax.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like something we would want to do in the .loc[] section above too (assuming we didn't already)

and for [] by itself too somewhere, unless we've already done that elsewhere

```{code-cell} ipython3
region_lang.loc[:, "mother_tongue":"lang_known"].agg(["mean", "std"])
region_lang["mother_tongue":"lang_known"].agg(["mean", "std"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, I thought : wasn't allowed in [] as per text above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, this is a typo and should be region_lang.loc[:, "most_at_home":"most_at_work"] as in the next section.

@joelostblom
Copy link
Contributor Author

@trevorcampbell a ping here just because I have started on the first worksheet and tutorial and some of the changes there depend on which decision we take here

@trevorcampbell
Copy link
Contributor

OK, I will look at this today at some point!

@joelostblom joelostblom merged commit 5ce18b6 into main Jul 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Ch3: review usage of loc[]
2 participants