DOC: pandas cheat sheet #13202

jreback · 2016-05-17T13:44:35Z

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

I think we could pretty much rip this off exactly as is and just substitute pandas functions directly.

Further could update comparison with R a bit.

anyone up for this?

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2016-05-17T14:15:27Z

+1 I have wanted to do this already for a long time! (but not sure I will be able to :-))

sinhrks · 2016-05-17T22:48:00Z

+1. shall we close #1618?

MrMauricioLeite · 2016-06-29T20:32:43Z

+1

Dr-Irv · 2016-12-02T21:01:04Z

I don't know R, but know that having a cheatsheet like this would be helpful. I'm going to give it a shot in Powerpoint, because I know it well and can make it pretty. I'm not sure what other tool to use that would let other people edit it and provide the good formatting.

jreback · 2016-12-02T21:10:35Z

@Dr-Irv we will just check in the pdf and whatever format it's in

Dr-Irv · 2016-12-02T22:38:58Z

Here is what I did today. Plagiarizing is blatant! Will be flying a lot next week, so this will be good to do on the plane.
Pandas Cheat Sheet.pdf

TomAugspurger · 2016-12-03T14:57:40Z

Nice start! That original from RStudio is CC 4.0, so AFAIK it's fine to copy / transform it as long as you acknowledge RStudio for the original.

Dr-Irv · 2016-12-09T22:19:06Z

Here is the first page of the cheat sheet. I'll be working on page 2 next. In most places, I found the pandas way of doing things. I used @jreback initial comment as a guideline, i.e., "I think we could pretty much rip this off exactly as is and just substitute pandas functions directly." There were places where the R example didn't make sense, so I made my own arbitrary choices.

Comments and criticism are welcome. I think it is better to get feedback here rather than anywhere else.
Pandas Cheat Sheet.pdf

jreback · 2016-12-10T15:25:53Z

for the subset section where you have:

df[['width','length','species']]
  Select multiple columns with specific names.
df['width'] or df.width
  Select single column with specific name.
df[[i for i in df.columns if '.' in i]]
  Select columns whose name contains a character string.
df[[i for i in df.columns if i.endswith("Length")]]
  Select columns whose name ends with a character string.
df[[i for i in df.columns if re.match('.t.',i)]]
  Select columns whose name matches a regular expression.
df[["x"+str(i) for i in range(1,6)]
  Select columns named x1, x2, x3, x4, x5.
df[[i for i in df.columns if i.startswith("Length")]]
  Select columns whose name starts with a character string.
df.loc[:,"x2":"x4"]
  Select all columns between x2 and x4 (inclusive).
df[[i for i in df.columns if i != "Species"]]
  Select all columns except Species.
df.iloc[:,[1,2,5]]
   Select columns in positions 1, 2 and 5

more idiomatic to do:

df[['width','length','species']]
  Select multiple columns with specific names.
df['width'] or df.width
  Select single column with specific name.
df.filter(regex='\.')
  Select columns whose name contains a character string.
df.filter(regex='Length$')
  Select columns whose name ends with a character string.
df.fitler(regex='.t.')
  Select columns whose name matches a regular expression.
df.filter(regex='x\d')
  Select columns named x1, x2, x3, x4, x5.
df.filter(regex='^Length')
  Select columns whose name starts with a character string.
df.loc[:,"x2":"x4"]
  Select all columns between x2 and x4 (inclusive).
df.filter(regex="^(?!Species|\\.).*")
  Select all columns except Species.
df.iloc[:,[1,2,5]]
   Select columns in positions 1, 2 and 5

though maybe elminate some of these column selections and show usage of .loc instead

Dr-Irv · 2016-12-13T20:29:49Z

@jreback Thanks for the suggestions. I've made changes in my current working copy.

Separate question - R has a cume_dist() method that computes the cumulative distribution of a vector. Similar to rank(pct=True), but different. Is there a pandas equivalent?

Dr-Irv · 2016-12-13T23:03:53Z

Here's my latest version. Only thing left to do is to do groupby() examples in the open space on the second page. Some notes: (1) Don't have correspondence to R's cume_dist(), so left it out, and added clip() instead. (2) Setdiff between DataFrames is not there (discussion in #4480) so the example at bottom right of page 2 isn't so pretty. Maybe someone knows a better way.
Pandas Cheat Sheet.pdf

jreback · 2016-12-13T23:42:54Z

@Dr-Irv looking good!.

I would remove use of .between (this is going to be removed, and is trivially replaced by .loc)

I would add use of .rolling() & .resample() (you said adding groupby). the .expanding().apply(any) examples is not very typical, would show something like .expanding().sum() (or .agg(['sum', 'count'])`` or whatever.

maybe use .plot somewhere?

of course its ONLY 2 pages! hahah

sinhrks · 2016-12-14T07:21:48Z

Looks really good. Few points:

Reshaping Data
- .assign example looks incorrect. It duplicates with Make New Variable section.
- Personally feels .rename is less used than setting columns directly.
- Add .reindex?
Subset Rows
- Add .dropna?
Make New Variable
- .drop should be moved to Subset Variables?

Dr-Irv · 2016-12-14T22:47:52Z

@jreback regarding your last comments. I've removed between(). I had it in there because it was in the R cheat sheet. I've taken out the two expanding().apply() examples that computed cumulative all and cumulative any. Those examples were in there because they were in the R version. In terms of adding expanding() examples, expanding().sum() and cumsum() are the same thing, so I added expanding().median() instead.

My plan was to use remaining space on second page for groupby(), and after that is done, I'll look for space for .rolling() and .resample(). As for .plot(), maybe that's a third page........

Dr-Irv · 2016-12-14T23:36:19Z

@sinhrks I deleted the .assign example. Was trying to copy something from R, but the example in "Make New Variables" is better. I added .reset_index() in its place, thinking that is used more often than .reindex. As for .rename, it's the method to use when doing method chaining. I wish there was a way to change all column names when using method chaining.

I'm space constrained to add .dropna to "Subset Variables" and to move .drop to "Subset Variables". One reason to keep .drop in "Make New Variables" is because after .assign, you often want to drop the original variables.

Let me get the full first draft done, and we discuss further enhancements.

Thank you all for the comments and feedback!

Dr-Irv · 2016-12-15T21:13:39Z

Here is a proposed first draft with the 2 pages that correspond pretty well to what was in the R cheat sheet. Based on suggestions above, I added some content that doesn't appear in the R version. Comments and criticism are welcome.

If you have ideas of things to add, due to space constraints, please suggest something to be deleted.

@jreback I have limited availability between 12/16 and 12/20, so I can take comments into account on 12/21, and then hopefully be done with it. I would like suggestions of where to put this in the source (i.e., which directory). Should the Powerpoint source and PDF both be in there (so others could modify the Powerpoint and create the PDF)? Also, would I then just submit a pull request with those 2 new documents (which would cause the tests to be run, which seems a bit silly since I'm just adding 2 new files that are not touched by the test scripts)?
Pandas Cheat Sheet.pdf

jreback · 2016-12-15T21:44:22Z

@Dr-Irv looks really good! is .resample anywhere?

yes do a PR with the powerpoint & pdf. as well as a small readme (or script) on how to build the pdf.

I would put in pandas\docs\cheatsheet\

Dr-Irv · 2016-12-16T15:37:42Z

@jreback Here's are my challenges with including .resample. One is space. The second is that it is specific to datetime-like indexes. Nowhere else in the cheat sheet do I have info about how pandas manages dates and times, so without that context, putting in .resample becomes difficult to explain. I think it would also need downsampling and upsampling examples. So space becomes a bigger issue.

With respect to conversion, I have Powerpoint on Windows, and it has an export PDF feature. Scripting would be operating system dependent (and probably not worth the time to figure out). So I will just document how to do it in a text file.

One other question - once published, where would links to the cheat sheet exist? On the pandas.pydata.org web site? Or from within the documentation itself? If the latter, how do we refer to something outside the documentation tree?

jreback · 2016-12-16T15:40:23Z

a README is fine.

you can refer to links like: https://github.com/pandas-dev/pandas/blob/master/doc/README.rst (for example), which is a static link to whatever is there.

we can add a link on the website as well.

on .resample() it would be nice to mention as this is one of pandas big strengths (esp compared to R), but I get the space issue.

is adding a 3rd page too much?

Dr-Irv · 2016-12-16T15:43:52Z

@jreback Adding a 3rd page is a possibility. In addition to discussing datetime-like issues, as well as .resample, I could add more on MultiIndex, and I was also thinking of adding information about input and output for common formats.

If I were to do that, should I work on that before submitting the PR? Or do the PR now for the current version, so it's out there, and then do a new PR once I finish a third page?

jorisvandenbossche · 2016-12-16T16:26:04Z

@Dr-Irv Really cool work! Thanks a lot for this!

I have some comments, but will save them for later. Just a quick one: I think the zdfdataframe on the second page (bottom right corner, "Set like operations") is not correct. Shouldn't also have the x2 column instead of x3? (otherwise the example results are not correct).

Regarding how to include this into pandas: another possibility is that you keep it in a pandas-cheatsheet repo from your own instead of putting it into pandas code base itself. But then of course include the same clear links to it in the pandas documentation as in the other case.
Two reasons I propose this: first, you keep more the credit and ownership of it. Second, what exactly is included will always be a bit subjective / based on personal taste. And I would like to avoid such discussion (although useful to see feelings about certain functionality, eg I rather use rename instead of assigning columns directly) on the main pandas issue tracker (it's already crowded enough :-)). In your repo you just make the final decision.
But if you would like to see it included in the pandas codebase (and not only link to it), that's fine for me as well.

jorisvandenbossche · 2016-12-16T16:28:22Z

@Dr-Irv BTW, would it be possible to already post a pptx version as well? I am giving a 3-day pandas course beginning next week, and was thinking to use this to give to the students. But with the pptx version I can make small adaptions (like the incorrect frame) myself, as your new version that you can make on 12/21 will be to late for the course.

jreback · 2016-12-16T16:36:30Z

@Dr-Irv btw, no problem with including a mention of the author as well!

Dr-Irv · 2016-12-21T16:26:56Z

@jorisvandenbossche I made an error in the definition of the frame 'zdf' in the Cheat Sheet. That is now fixed, so the examples are correct. I will include the PPTX in the Pull Request. I'd prefer it be part of the pandas main project, and then others can contribute. I have revealed my secret identity at the bottom. :->

Sorry that I couldn't get things to you prior to today as I was on vacation.

Pandas Cheat Sheet.pdf

jreback · 2016-12-21T17:06:43Z

looks good @Dr-Irv

if you want to put this in a PR, add to the 0.19.2 whatsnew (with a link to the location would be great).

closes #13202 closes #14943 (cherry picked from commit f79bc7a)

closes pandas-dev#13202 closes pandas-dev#14943

jreback added Docs Difficulty Intermediate labels May 17, 2016

jreback added this to the Next Major Release milestone May 17, 2016

Dr-Irv mentioned this issue Dec 21, 2016

DOC: Pandas Cheat Sheet (GH13202) #14943

Closed

jreback modified the milestones: 0.19.2, Next Major Release Dec 21, 2016

jreback closed this as completed in f79bc7a Dec 21, 2016

jorisvandenbossche pushed a commit that referenced this issue Dec 24, 2016

DOC: Pandas Cheat Sheet

9bf1ace

closes #13202 closes #14943 (cherry picked from commit f79bc7a)

ShaharBental pushed a commit to ShaharBental/pandas that referenced this issue Dec 26, 2016

DOC: Pandas Cheat Sheet

21fe538

closes pandas-dev#13202 closes pandas-dev#14943

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC: pandas cheat sheet #13202

DOC: pandas cheat sheet #13202

jreback commented May 17, 2016

jorisvandenbossche commented May 17, 2016

sinhrks commented May 17, 2016

MrMauricioLeite commented Jun 29, 2016

Dr-Irv commented Dec 2, 2016

jreback commented Dec 2, 2016

Dr-Irv commented Dec 2, 2016

TomAugspurger commented Dec 3, 2016

Dr-Irv commented Dec 9, 2016

jreback commented Dec 10, 2016

Dr-Irv commented Dec 13, 2016

Dr-Irv commented Dec 13, 2016

jreback commented Dec 13, 2016

sinhrks commented Dec 14, 2016

Dr-Irv commented Dec 14, 2016

Dr-Irv commented Dec 14, 2016

Dr-Irv commented Dec 15, 2016

jreback commented Dec 15, 2016

Dr-Irv commented Dec 16, 2016

jreback commented Dec 16, 2016

Dr-Irv commented Dec 16, 2016

jorisvandenbossche commented Dec 16, 2016

jorisvandenbossche commented Dec 16, 2016

jreback commented Dec 16, 2016

Dr-Irv commented Dec 21, 2016

jreback commented Dec 21, 2016