-
Notifications
You must be signed in to change notification settings - Fork 15
Suggestions for Ch3 - Wrangling #39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am working right now on Ch 1, and Ch 1 currently introduces Here is what Ch 1 does now:
But these (specifically for So Chapter 3 needs to fill in the gaps left. It needs to cover:
@joelostblom reasonable? |
Another comment (unrelated to my earlier one, but related to Ch3 editing) See #61 -- make sure to open the R chapter to make sure it aligns. It seems we based off an older version of the textbook. e.g. I discovered an old version of the can lang data in chapter 2 |
Sounds reasonable overall for what to cover in this chapter!
I don't think novices requires knowing much at all about indices so I think we can simplify this to teaching
Yeah, this is an unfortunate difference that is confusing. We could avoid it by always using
Sounds good, both for columns and rows.
I agree it can be nice to mention in the textbook, but not spend time on in class. |
Some notes / comments as I am working through this
|
I agree with this change! But it may be good to include a very brief intro to I think probably the right way to do it is to introduce them in an ad-hoc manner whenever they get used first (like I did in Ch1). If we find later that it's too hard to grok or find defns, we can carve out a special place for those defs.
I think I'm OK leaving it off...in the R version we had it specifically because later in the kmeans section we manipulate dataframes with columns of Kmeans results (which are more general objects). But I think python is much more natural for doing that kind of thing with pandas/sklearn, so probably won't be necessary in the python version.
I wouldn't use the term "1-dimensional" here -- too technical for 1st years. Try to stick to the writing in the R book (where reasonable), which we fine-tuned pretty carefully given experience teaching the class. Also in the series section, I would emphasize that while series technically allow multiple data types, it's usually very unwise to do so and we should really just think of series as having a single type of data.
I'm not sure what My main thought here is that this is a tough direct translation from R. I think your comment---"I actually think things are simpler if we remove the What is a list? section and go straight from a series to a DataFrame."---is right on the money. Pandas seems to be much more abstract in terms of how it represents dataframes under the hood, so I don't think it's worth getting into the nitty gritty of how they're stored. You can just tell students that data frames are an ordered collection of series and move on. Keep as much of the "intuition" and explanation from the R book as makes sense here, of course.
I didn't quite follow this comment -- but in general if there is a minor thing you don't think is super critical to handle right now, just open a separate issue thread for it and clearly document what remains to do there
Agree
Once I got to this point in the python version, the translation from R started getting really rough. In the R book, up until this point we have done simple things. If we want to do more complex analysis, we need multiple operations. Options:
In the python book, this story makes less sense. We have already been chaining operations together in Ch1 and Ch2 (and some fairly complex chaining earlier in Ch3 as well). So it's a bit odd to even have a section about that in the Py version. In fact, I think the right way to handle this is in Chapter 1, right before the visualization section. Please open a separate issue for that! We can probably get rid of the entire section in Ch3 for chaining, and just create a new section for chaining right before "Exploring data with visualizations" in ch1.
Some comments on this aggregation section in general:
As for |
Looking great! I agree with most things you two wrote already, some quick thoughts:
💯
I agree we should not use lang_summary = pd.DataFrame()
lang_summary = lang_summary.assign(min_most_at_home=[min(region_lang["most_at_home"])])
lang_summary = lang_summary.assign(max_most_at_home=[max(region_lang["most_at_home"])])
lang_summary to explain something that should be two short lines in separate cells region_lang["most_at_home"].min() region_lang["most_at_home"].max() I think the most natural place to show region_lang["most_at_home"].agg(['min', 'max'])
I think we should show apply with something that is not easy to do in pandas, whether that is a built-in function that does not exist in pandas or a user-defined function doesn't matter much to me (although I think the latter is a bit more useful for students). If we show apply with min, max, students aren't seeing a good use of that method, just a less preferred alternative pandas method and it might be confusing why we are showing that. |
One more followup: do we teach |
Also: I think it would be wise alongside
@joelostblom mentioned this in his original issue, and I'll add my +1 to this. You can't chain it, but it is a lot shorter (and honestly I use it a lot more often in practice). I'm going to use it in Ch 5+ to simplify things greatly! |
[]
instead ofdf[]
in LOsobject
andcategorical
while removinglist
,tuple
,none
anddict
(or include them in a separate table.melt
more intuitively, use the word "melt" in the explanation, e.g. "We say that we "melt" (or "pivot") the wide table into a longer format".value_vars
when we are usingid_vars
and want all the remaining columns to be melted. This would make the example code in the images significantly easier to read. (multiple places)melt
section is incorrect similarly to one in the previous chapter. Python does not know to keep reading when lines end in commas, it is all about open parens. The comma is needed to separate the parameters, but it could come at any line (I believe this should be corrected in the R book too)info
which we taught previously? (multiple places)str.split
as one thing.iloc
andloc
are not needed here either, we can use ranges for columns with[]
. (throughout this chapter)[]
instead ofdf[]
. I believe it is also not a function technically.assign
before teaching regular assignment to a pandas column. The latter is often more straightforward, even if it can't be chained. I thinkassign
might be better taught as advanced syntax and we would need to mentioned thelambda
caveat when used in chains of a modified dataframe so that students don't easily do a mistake and use the unmodified dataframe to create a new column.query
is added in the chaining section without having been explained before.[]
followed byloc
is discouraged and should be replaced with a singleloc
filtering both rows and columns (that's the main purpose of this indexer). (same under "more than two functions" section)by
keyword unnecessary for groupbyiloc
slicing in agg section is unnecessary, just use[]
apply
is not needed for summary stats on multiple columns, the regular pandas functions work just fine and are much more performant and readable that the current approach with the vanilla python functions.apply
is for custom functions.Some TODOs for myself to look more into to see if these sections can be improved as well:
to_numeric
section. Are we doing method chaining anywhere else?as_dtype
instead ofto_numeric
The text was updated successfully, but these errors were encountered: