Skip to content

Explain filtering of rows and selection of columns in a more Pandas-centric way #34

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
joelostblom opened this issue Aug 20, 2022 · 9 comments · Fixed by #48
Closed
Labels
bug Something isn't working needs-investigation Further information is requested

Comments

@joelostblom
Copy link
Contributor

The introduction of [] for row filtering and .loc[] for column filtering seems like a translation of filter and select from the tidyverse, but in Pandas [] can be used for either columns (by names) or rows (by slices or boolean masks), whereas .loc[] is used when both rows and columns are to be selected at the same time to avoid ambiguous chained operations.

@trevorcampbell trevorcampbell added enhancement New feature or request needs-investigation Further information is requested labels Dec 17, 2022
@trevorcampbell
Copy link
Contributor

Agree, we should probably fix this. It'll need some thought on how to do it right though.

@joelostblom
Copy link
Contributor Author

joelostblom commented Dec 18, 2022

I agree it is important to say this in the right way and I think it is worth putting time into because it can be confusing for students if we explain this central part in an unclear way. I think we can avoid iloc completely, focus on [] for all tasks that require either one selection or one filter operation, and reserve loc for cases where we need both a filter and selection in the same operation.

@trevorcampbell
Copy link
Contributor

I like that strategy for [] vs .loc[]

I think I will still explain iloc, because I can imagine students saying "but I just want the first 3 rows! Argh!" in class

It's easy enough to do this (row position vs index)

@trevorcampbell
Copy link
Contributor

I will try to incorporate this into my current pass because we definitely have to get this right the first time -- I'm actually going to relabel this a bug because it's so off currently

@trevorcampbell trevorcampbell added bug Something isn't working and removed enhancement New feature or request labels Dec 18, 2022
@joelostblom
Copy link
Contributor Author

joelostblom commented Dec 18, 2022

I think I will still explain iloc, because I can imagine students saying "but I just want the first 3 rows! Argh!" in class

A solution to this is df[:3] or df.head(3), which are both position-based. I prefer the former because it is more flexible and briefer. The only use case for iloc that I am aware of is if we want to start selecting columns by position, but I honestly think I have never done that myself.

@trevorcampbell
Copy link
Contributor

Hmm...OK, how about this: In chapter 1, we will only teach loc and [] (and only very briefly). Ch1 is just meant to be a very brief / direct intro without dwelling on details anyway, so makes sense.

Ch 3 we should give a deeper introduction to these though, including some of the trickiness, e.g. how [] can use both indices and row numbers depending on its input. I'm still tempted to introduce iloc then, even if we don't put too much effort into it.

@joelostblom
Copy link
Contributor Author

Yeah, I think it makes sense to start with an easier introduction that skips some of the details. Actually, it would make sense to change the current order of the two paragraphs and teach selecting columns before filtering rows because it is more straightforward to just type in the name of a column inside [] than to explain boolean expressions inside []. The third paragraph does require us to mention either [:5] or iloc.

I think this can be a good intro level of explanation:

  1. df[['col1, 'col2']] Select columns by name with lists
  2. df[0:5] Select rows by number with ranges
  3. df[df['col1'] > 5] Filter rows with conditions

The remaining operations are selecting columns by number and rows by name, but I don't think we teach that in the R version of the course either.

@trevorcampbell
Copy link
Contributor

trevorcampbell commented Dec 18, 2022

@joelostblom my plan for Ch1 is:

  • teach [] just for getting a single column as a series (for creating logical series for indexing .loc)
  • teach .loc[] for filtering and selecting

Then in chapter 3, we can go into the intricacies of indices, the full generality of [], a bit about iloc, etc.

I mentioned this as well in #39

I think this is the most natural way to get students going without getting bogged down in slicing/indexing/blah blah blah in their first lecture

(one caveat: I do need to check the worksheets and tutorials in week 1+2 to see if it's possible to get away with this)

@joelostblom
Copy link
Contributor Author

I made some comments on the specifics of this approach directly in the #48 . I agree with the general idea of getting students going without being bogged down in details.

@trevorcampbell trevorcampbell linked a pull request Dec 19, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working needs-investigation Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants