Suggestions for Ch3 - Wrangling #39

joelostblom · 2022-08-24T19:57:07Z

Some TODOs for myself to look more into to see if these sections can be improved as well:

TODO Check the pivoting under fig 29, seems odd to reassign columns like that
TODO important lacking functionality from pivot_table
TODO simplify code under fig 27 to not use pd.concat (or at least introduce concat before this)? This could also simplyfy the to_numeric section. Are we doing method chaining anywhere else?
TODO - [ ] df.filter + a regex instead of the startswith approach? At least easier for "contains", need to know ^ for startswith
TODO as_dtype instead of to_numeric
TODO double check NA handling in pandas min max, or why are we using python mix max? that is why na is inclueded... pandas would not. The entire assign aggregation section is odd, who assigns an empty df like this? Same for groupby

The text was updated successfully, but these errors were encountered:

trevorcampbell · 2022-12-18T19:35:51Z

I am working right now on Ch 1, and Ch 1 currently introduces [] as subsetting rows, and .loc[] as subsetting columns. That's an attempt at direct translation from R, but isn't how one should think about pandas (as mentioned in #34 )

Here is what Ch 1 does now:

we introduce [] as a way to subset rows or subset columns
we introduce .loc[] as a way to get a subset of both rows and columns

But these (specifically for []) aren't the most comprehensive / general introductions -- Ch1 is just to get students quickly going from point A to point B for a whole data analysis.

So Chapter 3 needs to fill in the gaps left. It needs to cover:

indices in dataframes (in Ch1, we dont even mention indices. The meaning (and syntax) of [] and loc[] and iloc[] depends on knowing what an index is)
python ranges (excluding endpoint) for [] versus inclusive .loc ranges on indices
slicing
series
difference between loc[] and iloc[] and when one would want to use (@joelostblom asserts that iloc should rarely be used, and I'm tempted to agree, but it would be nice to cover iloc so students know what it is and say explicitly that it is not common to use)

@joelostblom reasonable?

trevorcampbell · 2022-12-19T00:58:43Z

Another comment (unrelated to my earlier one, but related to Ch3 editing)

See #61 -- make sure to open the R chapter to make sure it aligns. It seems we based off an older version of the textbook.

e.g. I discovered an old version of the can lang data in chapter 2

joelostblom · 2022-12-19T02:34:33Z

Sounds reasonable overall for what to cover in this chapter!

indices in dataframes (in Ch1, we dont even mention indices. The meaning (and syntax) of [] and loc[] and iloc[] depends on knowing what an index is)

I don't think novices requires knowing much at all about indices so I think we can simplify this to teaching reset_index + row filtering whenever they want to filter an index that is anything else than a range.

python ranges (excluding endpoint) for [] versus inclusive .loc ranges on indices

Yeah, this is an unfortunate difference that is confusing. We could avoid it by always using head and tail for slices, but maybe it is better to be explicit here and point out the difference.

slicing

Sounds good, both for columns and rows.

difference between loc[] and iloc[] and when one would want to use (@joelostblom asserts that iloc should rarely be used, and I'm tempted to agree, but it would be nice to cover iloc so students know what it is and say explicitly that it is not common to use)

I agree it can be nice to mention in the textbook, but not spend time on in class.

lheagy · 2022-12-23T01:46:20Z

Some notes / comments as I am working through this

I have not updated Figure 14, but left a note in the text of what needs to change
I removed thecontainer objects (list, dict, tuple) as we don't have these types of objects as entires in a data frame in this course. I also removed the "Type category" as this isn't in the R version, and I didn't find it helpful (it was repetitive with the description)
- To me object seems quite abstract to introduce here. If you feel strongly, we can include it in the table, but on my initial pass, I have left it off
Input would be appreciated on array vs. list. I agree that the concept of a "labeled / named list" doesn't really exist in python. But I also don't think it is potentially challenging to introduce array because that makes a connection with numpy arrays, which enforce that the elements are all of the same type. You can create a valid series with a list of different types (although not recommended, there are no safeguards to throw an error if you do pd.Series([1, "a"])). So in my current pass, I have left it as list as this is something we introduce and seems to me to be the easiest shot through the explanation. Happy to revisit if you have other suggestions!
- I actually think things are simpler if we remove the What is a list? section and go straight from a series to a DataFrame. The
I think Fig 19 might now be Fig 15. If so, I think this is minor enough that it can be very low on the priority list. The table is conceptual anyways, so I don't think it is particularly important that we update it
I don't think Table 4 comparing list, Series, DataFrame is particularly helpful beyond the explanation beforehand. So I would suggest we remove it (I have commented it out).
I replaced the "chaining with [] and loc" with just "Chaining with .loc". I am not sure I am happy with the order here. I am somewhat inclined to move this content to where we talk about loc and selecting multiple rows / columns at once
do you have a way you like to explain reset_index() ? it isn't exactly obvious... So ideas on a simple explanation here would be appreciated
Following @joelostblom's comments, we need to think about apply and what we think is essential for students there. Introducing this with examples that we would encourage them to do another way (using built-in summary statistics isn't very satisfying, but we also haven't introduced lambda functions to them yet).

trevorcampbell · 2022-12-23T18:57:56Z

I removed the container objects (list, dict, tuple) as we don't have these types of objects as entires in a data frame in this course. I also removed the "Type category" as this isn't in the R version, and I didn't find it helpful (it was repetitive with the description)

I agree with this change!

But it may be good to include a very brief intro to list, dict, tuple somewhere in the chapter (book...) -- not because they'll put them in data frames, but because they need them for various arguments in pandas. E.g. in chapter 1, we need to use a dict to specify a map for column renaming.

I think probably the right way to do it is to introduce them in an ad-hoc manner whenever they get used first (like I did in Ch1). If we find later that it's too hard to grok or find defns, we can carve out a special place for those defs.

To me object seems quite abstract to introduce here. If you feel strongly, we can include it in the table, but on my initial pass, I have left it off

I think I'm OK leaving it off...in the R version we had it specifically because later in the kmeans section we manipulate dataframes with columns of Kmeans results (which are more general objects). But I think python is much more natural for doing that kind of thing with pandas/sklearn, so probably won't be necessary in the python version.

They are ordered and can be indexed. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, etc), including a mix of them

I wouldn't use the term "1-dimensional" here -- too technical for 1st years. Try to stick to the writing in the R book (where reasonable), which we fine-tuned pretty carefully given experience teaching the class.

Also in the series section, I would emphasize that while series technically allow multiple data types, it's usually very unwise to do so and we should really just think of series as having a single type of data.

Input would be appreciated on array vs. list.

I'm not sure what array means here -- python/pandas doesn't have an array type, does it? I don't think the connection to numpy is useful either, since students probably won't have encountered numpy yet (and may not even in the whole class)

My main thought here is that this is a tough direct translation from R. I think your comment---"I actually think things are simpler if we remove the What is a list? section and go straight from a series to a DataFrame."---is right on the money. Pandas seems to be much more abstract in terms of how it represents dataframes under the hood, so I don't think it's worth getting into the nitty gritty of how they're stored. You can just tell students that data frames are an ordered collection of series and move on.

Keep as much of the "intuition" and explanation from the R book as makes sense here, of course.

I think Fig 19 might now be Fig 15. If so, I think this is minor enough that it can be very low on the priority list. The table is conceptual anyways, so I don't think it is particularly important that we update it

I didn't quite follow this comment -- but in general if there is a minor thing you don't think is super critical to handle right now, just open a separate issue thread for it and clearly document what remains to do there

I don't think Table 4 comparing list, Series, DataFrame is particularly helpful beyond the explanation beforehand. So I would suggest we remove it (I have commented it out).

Agree

I replaced the "chaining with [] and loc" with just "Chaining with .loc". I am not sure I am happy with the order here. I am somewhat inclined to move this content to where we talk about loc and selecting multiple rows / columns at once

Once I got to this point in the python version, the translation from R started getting really rough.

In the R book, up until this point we have done simple things. If we want to do more complex analysis, we need multiple operations. Options:

composition
temporary vars
pipes <--- this one is good!

In the python book, this story makes less sense. We have already been chaining operations together in Ch1 and Ch2 (and some fairly complex chaining earlier in Ch3 as well). So it's a bit odd to even have a section about that in the Py version.

In fact, I think the right way to handle this is in Chapter 1, right before the visualization section.

Please open a separate issue for that! We can probably get rid of the entire section in Ch3 for chaining, and just create a new section for chaining right before "Exploring data with visualizations" in ch1.

we need to think about apply and what we think is essential for students there. Introducing this with examples that we would encourage them to do another way (using built-in summary statistics isn't very satisfying, but we also haven't introduced lambda functions to them yet).

Some comments on this aggregation section in general:

why are we using assign for aggregation? seems very unnatural. I don't think I'd do this in practice...
- perhaps it's because Pandas doesn't output a dataframe for summary stats, while R did. So the MDS students tried to replicate that. I think it would be fine to teach it without forcing the data frame at first. Then later in this chapter, we can say "well if you want to manipulate the summary statistics too, then you should put them back into a dataframe using assign"
I think built-ins are totally fine to use! In fact, it's the more natural analogue from the R version to use those built-ins anyway. It's also what I would use when I'm using pandas in the real world. The only constraint is that they should work well with groupby, but it looks like they do.
maybe include a table of commonly-used built-in summary funcs!

As for apply itself: I do think we should still teach it, more or less in the same way as what's there currently. We don't need to be explicit about what a lambda function is. You'll see in our examples we use max as the function input, which yes is technically a function object, but students don't need to know that to use it.

joelostblom · 2022-12-24T00:53:06Z

Looking great! I agree with most things you two wrote already, some quick thoughts:

My main thought here is that this is a tough direct translation from R. I think your comment---"I actually think things are simpler if we remove the What is a list? section and go straight from a series to a DataFrame."---is right on the money.

💯

why are we using assign for aggregation? seems very unnatural. I don't think I'd do this in practice... perhaps it's because Pandas doesn't output a dataframe for summary stats, while R did.

I agree we should not use assign here and that we don't need to care about getting these values as dataframes. I think we should also remove variable assignment. We currently have a fair amount of noise in the code here that I think can confuse student. For example, we use all these long cryptic lines:

lang_summary = pd.DataFrame()
lang_summary = lang_summary.assign(min_most_at_home=[min(region_lang["most_at_home"])])
lang_summary = lang_summary.assign(max_most_at_home=[max(region_lang["most_at_home"])])
lang_summary

to explain something that should be two short lines in separate cells

region_lang["most_at_home"].min()

region_lang["most_at_home"].max()

I think the most natural place to show agg is also here and add a cell with:

region_lang["most_at_home"].agg(['min', 'max'])

I think built-ins are totally fine to use! In fact, it's the more natural analogue from the R version to use those built-ins anyway. It's also what I would use when I'm using pandas in the real world

I think we should show apply with something that is not easy to do in pandas, whether that is a built-in function that does not exist in pandas or a user-defined function doesn't matter much to me (although I think the latter is a bit more useful for students). If we show apply with min, max, students aren't seeing a good use of that method, just a less preferred alternative pandas method and it might be confusing why we are showing that.

trevorcampbell · 2022-12-31T23:08:49Z

One more followup: do we teach replace here?

trevorcampbell · 2023-01-01T06:24:26Z

Also: I think it would be wise alongside assign to just teach basic column assignment, e.g.

df["new_col"] = df["old_col"] * 3

@joelostblom mentioned this in his original issue, and I'll add my +1 to this. You can't chain it, but it is a lot shorter (and honestly I use it a lot more often in practice). I'm going to use it in Ch 5+ to simplify things greatly!

trevorcampbell mentioned this issue Dec 18, 2022

Explain filtering of rows and selection of columns in a more Pandas-centric way #34

Closed

lheagy mentioned this issue Dec 23, 2022

Ch3: wrangling #76

Merged

trevorcampbell closed this as completed in #76 Jan 4, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for Ch3 - Wrangling #39

Suggestions for Ch3 - Wrangling #39

joelostblom commented Aug 24, 2022 •

edited

Loading

trevorcampbell commented Dec 18, 2022 •

edited

Loading

trevorcampbell commented Dec 19, 2022 •

edited

Loading

joelostblom commented Dec 19, 2022 •

edited by trevorcampbell

Loading

lheagy commented Dec 23, 2022

trevorcampbell commented Dec 23, 2022 •

edited

Loading

joelostblom commented Dec 24, 2022

trevorcampbell commented Dec 31, 2022

trevorcampbell commented Jan 1, 2023

Suggestions for Ch3 - Wrangling #39

Suggestions for Ch3 - Wrangling #39

Comments

joelostblom commented Aug 24, 2022 • edited Loading

trevorcampbell commented Dec 18, 2022 • edited Loading

trevorcampbell commented Dec 19, 2022 • edited Loading

joelostblom commented Dec 19, 2022 • edited by trevorcampbell Loading

lheagy commented Dec 23, 2022

trevorcampbell commented Dec 23, 2022 • edited Loading

joelostblom commented Dec 24, 2022

trevorcampbell commented Dec 31, 2022

trevorcampbell commented Jan 1, 2023

joelostblom commented Aug 24, 2022 •

edited

Loading

trevorcampbell commented Dec 18, 2022 •

edited

Loading

trevorcampbell commented Dec 19, 2022 •

edited

Loading

joelostblom commented Dec 19, 2022 •

edited by trevorcampbell

Loading

trevorcampbell commented Dec 23, 2022 •

edited

Loading