Skip to content

DOC: pandas cheat sheet #13202

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue May 17, 2016 · 25 comments
Closed

DOC: pandas cheat sheet #13202

jreback opened this issue May 17, 2016 · 25 comments
Labels
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented May 17, 2016

https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

I think we could pretty much rip this off exactly as is and just substitute pandas functions directly.

Further could update comparison with R a bit.

anyone up for this?

@jreback jreback added this to the Next Major Release milestone May 17, 2016
@jorisvandenbossche
Copy link
Member

+1 I have wanted to do this already for a long time! (but not sure I will be able to :-))

@sinhrks
Copy link
Member

sinhrks commented May 17, 2016

+1. shall we close #1618?

@MrMauricioLeite
Copy link

+1

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 2, 2016

I don't know R, but know that having a cheatsheet like this would be helpful. I'm going to give it a shot in Powerpoint, because I know it well and can make it pretty. I'm not sure what other tool to use that would let other people edit it and provide the good formatting.

@jreback
Copy link
Contributor Author

jreback commented Dec 2, 2016

@Dr-Irv we will just check in the pdf and whatever format it's in

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 2, 2016

Here is what I did today. Plagiarizing is blatant! Will be flying a lot next week, so this will be good to do on the plane.
Pandas Cheat Sheet.pdf

@TomAugspurger
Copy link
Contributor

Nice start! That original from RStudio is CC 4.0, so AFAIK it's fine to copy / transform it as long as you acknowledge RStudio for the original.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 9, 2016

Here is the first page of the cheat sheet. I'll be working on page 2 next. In most places, I found the pandas way of doing things. I used @jreback initial comment as a guideline, i.e., "I think we could pretty much rip this off exactly as is and just substitute pandas functions directly." There were places where the R example didn't make sense, so I made my own arbitrary choices.

Comments and criticism are welcome. I think it is better to get feedback here rather than anywhere else.
Pandas Cheat Sheet.pdf

@jreback
Copy link
Contributor Author

jreback commented Dec 10, 2016

for the subset section where you have:

df[['width','length','species']]
  Select multiple columns with specific names.
df['width'] or df.width
  Select single column with specific name.
df[[i for i in df.columns if '.' in i]]
  Select columns whose name contains a character string.
df[[i for i in df.columns if i.endswith("Length")]]
  Select columns whose name ends with a character string.
df[[i for i in df.columns if re.match('.t.',i)]]
  Select columns whose name matches a regular expression.
df[["x"+str(i) for i in range(1,6)]
  Select columns named x1, x2, x3, x4, x5.
df[[i for i in df.columns if i.startswith("Length")]]
  Select columns whose name starts with a character string.
df.loc[:,"x2":"x4"]
  Select all columns between x2 and x4 (inclusive).
df[[i for i in df.columns if i != "Species"]]
  Select all columns except Species.
df.iloc[:,[1,2,5]]
   Select columns in positions 1, 2 and 5

more idiomatic to do:

df[['width','length','species']]
  Select multiple columns with specific names.
df['width'] or df.width
  Select single column with specific name.
df.filter(regex='\.')
  Select columns whose name contains a character string.
df.filter(regex='Length$')
  Select columns whose name ends with a character string.
df.fitler(regex='.t.')
  Select columns whose name matches a regular expression.
df.filter(regex='x\d')
  Select columns named x1, x2, x3, x4, x5.
df.filter(regex='^Length')
  Select columns whose name starts with a character string.
df.loc[:,"x2":"x4"]
  Select all columns between x2 and x4 (inclusive).
df.filter(regex="^(?!Species|\\.).*")
  Select all columns except Species.
df.iloc[:,[1,2,5]]
   Select columns in positions 1, 2 and 5

though maybe elminate some of these column selections and show usage of .loc instead

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 13, 2016

@jreback Thanks for the suggestions. I've made changes in my current working copy.

Separate question - R has a cume_dist() method that computes the cumulative distribution of a vector. Similar to rank(pct=True), but different. Is there a pandas equivalent?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 13, 2016

Here's my latest version. Only thing left to do is to do groupby() examples in the open space on the second page. Some notes: (1) Don't have correspondence to R's cume_dist(), so left it out, and added clip() instead. (2) Setdiff between DataFrames is not there (discussion in #4480) so the example at bottom right of page 2 isn't so pretty. Maybe someone knows a better way.
Pandas Cheat Sheet.pdf

@jreback
Copy link
Contributor Author

jreback commented Dec 13, 2016

@Dr-Irv looking good!.

I would remove use of .between (this is going to be removed, and is trivially replaced by .loc)

I would add use of .rolling() & .resample() (you said adding groupby). the .expanding().apply(any) examples is not very typical, would show something like .expanding().sum() (or .agg(['sum', 'count'])`` or whatever.

maybe use .plot somewhere?

of course its ONLY 2 pages! hahah

@sinhrks
Copy link
Member

sinhrks commented Dec 14, 2016

Looks really good. Few points:

  • Reshaping Data
    • .assign example looks incorrect. It duplicates with Make New Variable section.
    • Personally feels .rename is less used than setting columns directly.
    • Add .reindex?
  • Subset Rows
    • Add .dropna?
  • Make New Variable
    • .drop should be moved to Subset Variables?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 14, 2016

@jreback regarding your last comments. I've removed between(). I had it in there because it was in the R cheat sheet. I've taken out the two expanding().apply() examples that computed cumulative all and cumulative any. Those examples were in there because they were in the R version. In terms of adding expanding() examples, expanding().sum() and cumsum() are the same thing, so I added expanding().median() instead.

My plan was to use remaining space on second page for groupby(), and after that is done, I'll look for space for .rolling() and .resample(). As for .plot(), maybe that's a third page........

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 14, 2016

@sinhrks I deleted the .assign example. Was trying to copy something from R, but the example in "Make New Variables" is better. I added .reset_index() in its place, thinking that is used more often than .reindex. As for .rename, it's the method to use when doing method chaining. I wish there was a way to change all column names when using method chaining.

I'm space constrained to add .dropna to "Subset Variables" and to move .drop to "Subset Variables". One reason to keep .drop in "Make New Variables" is because after .assign, you often want to drop the original variables.

Let me get the full first draft done, and we discuss further enhancements.

Thank you all for the comments and feedback!

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 15, 2016

Here is a proposed first draft with the 2 pages that correspond pretty well to what was in the R cheat sheet. Based on suggestions above, I added some content that doesn't appear in the R version. Comments and criticism are welcome.

If you have ideas of things to add, due to space constraints, please suggest something to be deleted.

@jreback I have limited availability between 12/16 and 12/20, so I can take comments into account on 12/21, and then hopefully be done with it. I would like suggestions of where to put this in the source (i.e., which directory). Should the Powerpoint source and PDF both be in there (so others could modify the Powerpoint and create the PDF)? Also, would I then just submit a pull request with those 2 new documents (which would cause the tests to be run, which seems a bit silly since I'm just adding 2 new files that are not touched by the test scripts)?
Pandas Cheat Sheet.pdf

@jreback
Copy link
Contributor Author

jreback commented Dec 15, 2016

@Dr-Irv looks really good! is .resample anywhere?

yes do a PR with the powerpoint & pdf. as well as a small readme (or script) on how to build the pdf.

I would put in pandas\docs\cheatsheet\

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 16, 2016

@jreback Here's are my challenges with including .resample. One is space. The second is that it is specific to datetime-like indexes. Nowhere else in the cheat sheet do I have info about how pandas manages dates and times, so without that context, putting in .resample becomes difficult to explain. I think it would also need downsampling and upsampling examples. So space becomes a bigger issue.

With respect to conversion, I have Powerpoint on Windows, and it has an export PDF feature. Scripting would be operating system dependent (and probably not worth the time to figure out). So I will just document how to do it in a text file.

One other question - once published, where would links to the cheat sheet exist? On the pandas.pydata.org web site? Or from within the documentation itself? If the latter, how do we refer to something outside the documentation tree?

@jreback
Copy link
Contributor Author

jreback commented Dec 16, 2016

a README is fine.

you can refer to links like: https://github.com/pandas-dev/pandas/blob/master/doc/README.rst (for example), which is a static link to whatever is there.

we can add a link on the website as well.

on .resample() it would be nice to mention as this is one of pandas big strengths (esp compared to R), but I get the space issue.

is adding a 3rd page too much?

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 16, 2016

@jreback Adding a 3rd page is a possibility. In addition to discussing datetime-like issues, as well as .resample, I could add more on MultiIndex, and I was also thinking of adding information about input and output for common formats.

If I were to do that, should I work on that before submitting the PR? Or do the PR now for the current version, so it's out there, and then do a new PR once I finish a third page?

@jorisvandenbossche
Copy link
Member

@Dr-Irv Really cool work! Thanks a lot for this!

I have some comments, but will save them for later. Just a quick one: I think the zdfdataframe on the second page (bottom right corner, "Set like operations") is not correct. Shouldn't also have the x2 column instead of x3? (otherwise the example results are not correct).

Regarding how to include this into pandas: another possibility is that you keep it in a pandas-cheatsheet repo from your own instead of putting it into pandas code base itself. But then of course include the same clear links to it in the pandas documentation as in the other case.
Two reasons I propose this: first, you keep more the credit and ownership of it. Second, what exactly is included will always be a bit subjective / based on personal taste. And I would like to avoid such discussion (although useful to see feelings about certain functionality, eg I rather use rename instead of assigning columns directly) on the main pandas issue tracker (it's already crowded enough :-)). In your repo you just make the final decision.
But if you would like to see it included in the pandas codebase (and not only link to it), that's fine for me as well.

@jorisvandenbossche
Copy link
Member

@Dr-Irv BTW, would it be possible to already post a pptx version as well? I am giving a 3-day pandas course beginning next week, and was thinking to use this to give to the students. But with the pptx version I can make small adaptions (like the incorrect frame) myself, as your new version that you can make on 12/21 will be to late for the course.

@jreback
Copy link
Contributor Author

jreback commented Dec 16, 2016

@Dr-Irv btw, no problem with including a mention of the author as well!

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Dec 21, 2016

@jorisvandenbossche I made an error in the definition of the frame 'zdf' in the Cheat Sheet. That is now fixed, so the examples are correct. I will include the PPTX in the Pull Request. I'd prefer it be part of the pandas main project, and then others can contribute. I have revealed my secret identity at the bottom. :->

Sorry that I couldn't get things to you prior to today as I was on vacation.

Pandas Cheat Sheet.pdf

@jreback
Copy link
Contributor Author

jreback commented Dec 21, 2016

looks good @Dr-Irv

if you want to put this in a PR, add to the 0.19.2 whatsnew (with a link to the location would be great).

@jreback jreback modified the milestones: 0.19.2, Next Major Release Dec 21, 2016
jorisvandenbossche pushed a commit that referenced this issue Dec 24, 2016
closes #13202
closes #14943

(cherry picked from commit f79bc7a)
ShaharBental pushed a commit to ShaharBental/pandas that referenced this issue Dec 26, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants