Skip to content

Suggestions for Ch3 - Wrangling #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
22 of 35 tasks
joelostblom opened this issue Aug 24, 2022 · 8 comments · Fixed by #76
Closed
22 of 35 tasks

Suggestions for Ch3 - Wrangling #39

joelostblom opened this issue Aug 24, 2022 · 8 comments · Fixed by #76

Comments

@joelostblom
Copy link
Contributor

joelostblom commented Aug 24, 2022

  • [] instead of df[] in LOs
  • Use proper double newline after imports (pandas)
  • "In Python, pandas series are arrays with labels." I don't think we have defined "arrays" before this.
  • "to create the vector region as shown" -> "to create the SERIES region as shown"
  • Fig 18 starts numbering at 1 for the series, but it starts at 0 in Python.
    • There is also no "character" type in Python, strings are read in as objects or strin
  • Table 3 focuses on general python types rather than pandas types. I think the latter makes more sense, so we should include object and categorical while removing list, tuple, none and dict (or include them in a separate table.
    • We don't include all R types, e.g. matrices are excluded, so we don't need to include all Python types.
  • I think it makes more sense to talk about numpy arrays and/or dictionaries instead of "lists" when explaining pandas dataframes. (Named) lists are only appropriate for explaining R dataframes.
  • Fig 19 shows R output with the boolean value in all caps
  • R&Py I think it would be good with an explicit mention that the total size of the data can change when pivoting and specifically how it changes for this figure.
  • Explain melt more intuitively, use the word "melt" in the explanation, e.g. "We say that we "melt" (or "pivot") the wide table into a longer format".
  • We don't need to specify value_vars when we are using id_vars and want all the remaining columns to be melted. This would make the example code in the images significantly easier to read. (multiple places)
  • The note at the end of the melt section is incorrect similarly to one in the previous chapter. Python does not know to keep reading when lines end in commas, it is all about open parens. The comma is needed to separate the parameters, but it could come at any line (I believe this should be corrected in the R book too)
  • Figure 3.9 from the R book is missing
  • Properly format the code in fig25 and fig 26 to have one parameter per row instead of being in one long line.
  • Why use dtypes here instead of info which we taught previously? (multiple places)
  • Explain that .str.split is using an accessors with functions/methods dedicated for a data type. Can be said in a simpler way but, should be introduced instead of just showing str.split as one thing.
  • iloc and loc are not needed here either, we can use ranges for columns with []. (throughout this chapter)
    • R&Py In general I think it is important to point out that if the order of the columns change unexpectingly, using a range can lead to unexpected output.
  • Just [] instead of df[]. I believe it is also not a function technically.
  • I'm a bit hesitant to teaching assign before teaching regular assignment to a pandas column. The latter is often more straightforward, even if it can't be chained. I think assign might be better taught as advanced syntax and we would need to mentioned the lambda caveat when used in chains of a modified dataframe so that students don't easily do a mistake and use the unmodified dataframe to create a new column.
  • query is added in the chaining section without having been explained before.
  • R&Py I don't understand the note "You might also have noticed that we split the function calls across lines", isn't that the entire point of this section and it was just explained in the previous paragraph?
  • The section about chained indexing with [] followed by loc is discouraged and should be replaced with a single loc filtering both rows and columns (that's the main purpose of this indexer). (same under "more than two functions" section)
  • Unnecessary whitespace in Fig 29
  • by keyword unnecessary for groupby
  • reset_index not explained with agg, multi-index should at least be mentioned with a link to where to read more so that they know why we are resetting.
  • iloc slicing in agg section is unnecessary, just use []
  • apply is not needed for summary stats on multiple columns, the regular pandas functions work just fine and are much more performant and readable that the current approach with the vanilla python functions. apply is for custom functions.
  • TODO I don't think skipna is needed here either
  • The "Apply functions across many columns" is both incorrect and a repetition of the previous section.
  • lambda function used without introduction
  • We don't need apply to apply across one row either, again the built-in functions are preferred with axis='row' or 1.
  • I think this last section chaining is quite messy overall and could use a complete rewrite of the code. We should make sure that all the functionality from the R book is included here since the languages diverge in their approaches on these topics.
    • apply could be its own section talking about using functions that are not in pandas, but being discouraged for everything that is built into pandas already.

Some TODOs for myself to look more into to see if these sections can be improved as well:

  • TODO Check the pivoting under fig 29, seems odd to reassign columns like that
  • TODO important lacking functionality from pivot_table
  • TODO simplify code under fig 27 to not use pd.concat (or at least introduce concat before this)? This could also simplyfy the to_numeric section. Are we doing method chaining anywhere else?
  • TODO - [ ] df.filter + a regex instead of the startswith approach? At least easier for "contains", need to know ^ for startswith
  • TODO as_dtype instead of to_numeric
  • TODO double check NA handling in pandas min max, or why are we using python mix max? that is why na is inclueded... pandas would not. The entire assign aggregation section is odd, who assigns an empty df like this? Same for groupby
@trevorcampbell
Copy link
Contributor

trevorcampbell commented Dec 18, 2022

I am working right now on Ch 1, and Ch 1 currently introduces [] as subsetting rows, and .loc[] as subsetting columns. That's an attempt at direct translation from R, but isn't how one should think about pandas (as mentioned in #34 )

Here is what Ch 1 does now:

  • we introduce [] as a way to subset rows or subset columns
  • we introduce .loc[] as a way to get a subset of both rows and columns

But these (specifically for []) aren't the most comprehensive / general introductions -- Ch1 is just to get students quickly going from point A to point B for a whole data analysis.

So Chapter 3 needs to fill in the gaps left. It needs to cover:

  • indices in dataframes (in Ch1, we dont even mention indices. The meaning (and syntax) of [] and loc[] and iloc[] depends on knowing what an index is)
  • python ranges (excluding endpoint) for [] versus inclusive .loc ranges on indices
  • slicing
  • series
  • difference between loc[] and iloc[] and when one would want to use (@joelostblom asserts that iloc should rarely be used, and I'm tempted to agree, but it would be nice to cover iloc so students know what it is and say explicitly that it is not common to use)

@joelostblom reasonable?

@trevorcampbell
Copy link
Contributor

trevorcampbell commented Dec 19, 2022

Another comment (unrelated to my earlier one, but related to Ch3 editing)

See #61 -- make sure to open the R chapter to make sure it aligns. It seems we based off an older version of the textbook.

e.g. I discovered an old version of the can lang data in chapter 2

@joelostblom
Copy link
Contributor Author

joelostblom commented Dec 19, 2022

Sounds reasonable overall for what to cover in this chapter!

indices in dataframes (in Ch1, we dont even mention indices. The meaning (and syntax) of [] and loc[] and iloc[] depends on knowing what an index is)

I don't think novices requires knowing much at all about indices so I think we can simplify this to teaching reset_index + row filtering whenever they want to filter an index that is anything else than a range.

python ranges (excluding endpoint) for [] versus inclusive .loc ranges on indices

Yeah, this is an unfortunate difference that is confusing. We could avoid it by always using head and tail for slices, but maybe it is better to be explicit here and point out the difference.

image

slicing

Sounds good, both for columns and rows.

difference between loc[] and iloc[] and when one would want to use (@joelostblom asserts that iloc should rarely be used, and I'm tempted to agree, but it would be nice to cover iloc so students know what it is and say explicitly that it is not common to use)

I agree it can be nice to mention in the textbook, but not spend time on in class.

@lheagy
Copy link
Contributor

lheagy commented Dec 23, 2022

Some notes / comments as I am working through this

  • I have not updated Figure 14, but left a note in the text of what needs to change
  • I removed thecontainer objects (list, dict, tuple) as we don't have these types of objects as entires in a data frame in this course. I also removed the "Type category" as this isn't in the R version, and I didn't find it helpful (it was repetitive with the description)
    • To me object seems quite abstract to introduce here. If you feel strongly, we can include it in the table, but on my initial pass, I have left it off
  • Input would be appreciated on array vs. list. I agree that the concept of a "labeled / named list" doesn't really exist in python. But I also don't think it is potentially challenging to introduce array because that makes a connection with numpy arrays, which enforce that the elements are all of the same type. You can create a valid series with a list of different types (although not recommended, there are no safeguards to throw an error if you do pd.Series([1, "a"])). So in my current pass, I have left it as list as this is something we introduce and seems to me to be the easiest shot through the explanation. Happy to revisit if you have other suggestions!
    • I actually think things are simpler if we remove the What is a list? section and go straight from a series to a DataFrame. The
  • I think Fig 19 might now be Fig 15. If so, I think this is minor enough that it can be very low on the priority list. The table is conceptual anyways, so I don't think it is particularly important that we update it
  • I don't think Table 4 comparing list, Series, DataFrame is particularly helpful beyond the explanation beforehand. So I would suggest we remove it (I have commented it out).
  • I replaced the "chaining with [] and loc" with just "Chaining with .loc". I am not sure I am happy with the order here. I am somewhat inclined to move this content to where we talk about loc and selecting multiple rows / columns at once
  • do you have a way you like to explain reset_index() ? it isn't exactly obvious... So ideas on a simple explanation here would be appreciated
  • Following @joelostblom's comments, we need to think about apply and what we think is essential for students there. Introducing this with examples that we would encourage them to do another way (using built-in summary statistics isn't very satisfying, but we also haven't introduced lambda functions to them yet).

@lheagy lheagy mentioned this issue Dec 23, 2022
@trevorcampbell
Copy link
Contributor

trevorcampbell commented Dec 23, 2022

I removed the container objects (list, dict, tuple) as we don't have these types of objects as entires in a data frame in this course. I also removed the "Type category" as this isn't in the R version, and I didn't find it helpful (it was repetitive with the description)

I agree with this change!

But it may be good to include a very brief intro to list, dict, tuple somewhere in the chapter (book...) -- not because they'll put them in data frames, but because they need them for various arguments in pandas. E.g. in chapter 1, we need to use a dict to specify a map for column renaming.

I think probably the right way to do it is to introduce them in an ad-hoc manner whenever they get used first (like I did in Ch1). If we find later that it's too hard to grok or find defns, we can carve out a special place for those defs.

To me object seems quite abstract to introduce here. If you feel strongly, we can include it in the table, but on my initial pass, I have left it off

I think I'm OK leaving it off...in the R version we had it specifically because later in the kmeans section we manipulate dataframes with columns of Kmeans results (which are more general objects). But I think python is much more natural for doing that kind of thing with pandas/sklearn, so probably won't be necessary in the python version.

They are ordered and can be indexed. They are strictly 1-dimensional and can contain any data type (integers, strings, floats, etc), including a mix of them

I wouldn't use the term "1-dimensional" here -- too technical for 1st years. Try to stick to the writing in the R book (where reasonable), which we fine-tuned pretty carefully given experience teaching the class.

Also in the series section, I would emphasize that while series technically allow multiple data types, it's usually very unwise to do so and we should really just think of series as having a single type of data.

Input would be appreciated on array vs. list.

I'm not sure what array means here -- python/pandas doesn't have an array type, does it? I don't think the connection to numpy is useful either, since students probably won't have encountered numpy yet (and may not even in the whole class)

My main thought here is that this is a tough direct translation from R. I think your comment---"I actually think things are simpler if we remove the What is a list? section and go straight from a series to a DataFrame."---is right on the money. Pandas seems to be much more abstract in terms of how it represents dataframes under the hood, so I don't think it's worth getting into the nitty gritty of how they're stored. You can just tell students that data frames are an ordered collection of series and move on.

Keep as much of the "intuition" and explanation from the R book as makes sense here, of course.

I think Fig 19 might now be Fig 15. If so, I think this is minor enough that it can be very low on the priority list. The table is conceptual anyways, so I don't think it is particularly important that we update it

I didn't quite follow this comment -- but in general if there is a minor thing you don't think is super critical to handle right now, just open a separate issue thread for it and clearly document what remains to do there

I don't think Table 4 comparing list, Series, DataFrame is particularly helpful beyond the explanation beforehand. So I would suggest we remove it (I have commented it out).

Agree

I replaced the "chaining with [] and loc" with just "Chaining with .loc". I am not sure I am happy with the order here. I am somewhat inclined to move this content to where we talk about loc and selecting multiple rows / columns at once

Once I got to this point in the python version, the translation from R started getting really rough.

In the R book, up until this point we have done simple things. If we want to do more complex analysis, we need multiple operations. Options:

  • composition
  • temporary vars
  • pipes <--- this one is good!

In the python book, this story makes less sense. We have already been chaining operations together in Ch1 and Ch2 (and some fairly complex chaining earlier in Ch3 as well). So it's a bit odd to even have a section about that in the Py version.

In fact, I think the right way to handle this is in Chapter 1, right before the visualization section.

Please open a separate issue for that! We can probably get rid of the entire section in Ch3 for chaining, and just create a new section for chaining right before "Exploring data with visualizations" in ch1.

we need to think about apply and what we think is essential for students there. Introducing this with examples that we would encourage them to do another way (using built-in summary statistics isn't very satisfying, but we also haven't introduced lambda functions to them yet).

Some comments on this aggregation section in general:

  • why are we using assign for aggregation? seems very unnatural. I don't think I'd do this in practice...
    • perhaps it's because Pandas doesn't output a dataframe for summary stats, while R did. So the MDS students tried to replicate that. I think it would be fine to teach it without forcing the data frame at first. Then later in this chapter, we can say "well if you want to manipulate the summary statistics too, then you should put them back into a dataframe using assign"
  • I think built-ins are totally fine to use! In fact, it's the more natural analogue from the R version to use those built-ins anyway. It's also what I would use when I'm using pandas in the real world. The only constraint is that they should work well with groupby, but it looks like they do.
  • maybe include a table of commonly-used built-in summary funcs!

As for apply itself: I do think we should still teach it, more or less in the same way as what's there currently. We don't need to be explicit about what a lambda function is. You'll see in our examples we use max as the function input, which yes is technically a function object, but students don't need to know that to use it.

@joelostblom
Copy link
Contributor Author

Looking great! I agree with most things you two wrote already, some quick thoughts:

My main thought here is that this is a tough direct translation from R. I think your comment---"I actually think things are simpler if we remove the What is a list? section and go straight from a series to a DataFrame."---is right on the money.

💯

why are we using assign for aggregation? seems very unnatural. I don't think I'd do this in practice... perhaps it's because Pandas doesn't output a dataframe for summary stats, while R did.

I agree we should not use assign here and that we don't need to care about getting these values as dataframes. I think we should also remove variable assignment. We currently have a fair amount of noise in the code here that I think can confuse student. For example, we use all these long cryptic lines:

lang_summary = pd.DataFrame()
lang_summary = lang_summary.assign(min_most_at_home=[min(region_lang["most_at_home"])])
lang_summary = lang_summary.assign(max_most_at_home=[max(region_lang["most_at_home"])])
lang_summary

to explain something that should be two short lines in separate cells

region_lang["most_at_home"].min()
region_lang["most_at_home"].max()

I think the most natural place to show agg is also here and add a cell with:

region_lang["most_at_home"].agg(['min', 'max'])

I think built-ins are totally fine to use! In fact, it's the more natural analogue from the R version to use those built-ins anyway. It's also what I would use when I'm using pandas in the real world

I think we should show apply with something that is not easy to do in pandas, whether that is a built-in function that does not exist in pandas or a user-defined function doesn't matter much to me (although I think the latter is a bit more useful for students). If we show apply with min, max, students aren't seeing a good use of that method, just a less preferred alternative pandas method and it might be confusing why we are showing that.

@trevorcampbell
Copy link
Contributor

One more followup: do we teach replace here?

@trevorcampbell
Copy link
Contributor

Also: I think it would be wise alongside assign to just teach basic column assignment, e.g.

df["new_col"] = df["old_col"] * 3

@joelostblom mentioned this in his original issue, and I'll add my +1 to this. You can't chain it, but it is a lot shorter (and honestly I use it a lot more often in practice). I'm going to use it in Ch 5+ to simplify things greatly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants