Skip to content

API/ENH: Add mutate like method to DataFrames #9229

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Jan 12, 2015 · 23 comments · Fixed by #9239
Closed

API/ENH: Add mutate like method to DataFrames #9229

TomAugspurger opened this issue Jan 12, 2015 · 23 comments · Fixed by #9239
Milestone

Comments

@TomAugspurger
Copy link
Contributor

In my notebook comparing dplyr and pandas, I gained a new level of appreciation for the ability to chain strings of operations together. In my own code, the biggest impediment to this is adding additional columns that are calculations on existing columns. For example

# R / dplyr
mutate(flights,
   gain = arr_delay - dep_delay,
   speed = distance / air_time * 60)

# ... calculation involving these

vs.

flights['gain'] = flights.arr_delay - flights.dep_delay
flights['speed'] = flights.distance / flights.air_time * 60

# ... calculation involving these later

just doesn't flow as nicely, especially if this mutate is in the middle of a chain.

I'd propose a new method (perhaps stealing mutate) that's similar to dplyr's.
The function signature could be kwarg only, where the keywords are the new column names. e.g.

flights.mutate(gain=flights.arr_delay - flights.dep_delay

This would return a DataFrame with the new column gain in addition to the original columns.

Worked out example

import pandas as pd
import seaborn as sns

iris = sns.load_dataset('iris')

(iris.query('sepal_length > 4.5')
     .mutate(ratio=iris.sepal_length / iris.sepal_width)  # new part
     .groupby(pd.cut(iris.ratio)).mean()
)

Thoughts?

@TomAugspurger TomAugspurger added this to the 0.16.0 milestone Jan 12, 2015
@jreback
Copy link
Contributor

jreback commented Jan 12, 2015

If this syntax were supported (and not that difficult), IOW, a multi-assignment.
Then .mutate would be just sugar.

In [1]: df = DataFrame(np.random.randn(5,3),columns=list('abc'))

In [2]: df
Out[2]: 
          a         b         c
0 -1.818219 -0.550046  0.705300
1  0.865112  0.130718 -0.117105
2 -0.914553  0.575054 -0.881258
3 -1.417902  1.925067 -2.121546
4  0.604826 -0.461150  0.994271

In [3]: df[['d','e']] = df['a'] + 1, df['b'] + 1
KeyError: "['d' 'e'] not in index"

@jreback
Copy link
Contributor

jreback commented Jan 12, 2015

@TomAugspurger can you add a mini example (of split-apply-combine), showing how current (and potentially new) syntax would work?

@TomAugspurger
Copy link
Contributor Author

Added an example to the end of the original post.

@TomAugspurger
Copy link
Contributor Author

Allowing multiple new columns to be added shouldn't be too hard. dplyr does allow calculations to refer to new columns within the same mutate, e.g.

mutate(flights,
   gain = arr_delay - dep_delay,
   gain_per_hour = gain / (air_time / 60)
)

The way I'm seeing the function signature right now is df.mutate(**kwargs), which may make that example a bit tricky since dicts aren't ordered. We might be able to workaround that though, but for now it could be split into two mutates.

@shoyer
Copy link
Member

shoyer commented Jan 12, 2015

I really like this idea. mutate is much better than dataframes with actual mutable state (as encouraged by Python's syntax).

@TomAugspurger I'm not sure what to make of this example:

(iris.query('sepal_length > 4.5')
     .mutate(ratio=iris.sepal_length / iris.sepal_width)  # new part
     .groupby(pd.cut(iris.ratio)).mean()
)

In particular, are you suggesting that iris.sepal_length on the second line should somehow refer to the queried version of the iris dataset?

Note that @mrocklin added similar dplyr like syntax to blaze: blaze/blaze#484

This is definitely a place where Python's syntax (and limited magic, which is usually a good thing) makes things a little trickier than in R.

@mrocklin
Copy link
Contributor

FWIW, the semantics of plyr's transform, mutate's predecessor, might be a better fit.

Transform does not allow one new column in a mutate to depend on another new column in mutate. It also doesn't have the word mutate in it, which is good if you want the operation to make a new DataFrame.

@jorisvandenbossche
Copy link
Member

I had the same question as @shoyer:
In your example, iris in mutate refers to the original iris, so the result of iris.sepal_length / iris.sepal_width cannot be added just like that to the frame. But this is also what iris['ratio'] = iris.sepal_length / iris.sepal_width would do -> automatic label alignment, and only adding the values of the right hand side at the labels of the left-hand side.
But, if you do this after a query (what the chaining allows), you of course will calculate more than necessary.

Further, I was also thinking this looks very much like append (if you would have an axis=1 argument, now append is just for rows). With as small hack to have it add columns:

In [41]: (iris.query('sepal_length > 4.5')
    ...:      .T.append(pd.Series(iris.sepal_length / iris.sepal_width, name='ratio')).T
    ...:      .head())
    ...: 
Out[41]: 
  sepal_length sepal_width petal_length petal_width species     ratio
0          5.1         3.5          1.4         0.2  setosa  1.457143
1          4.9           3          1.4         0.2  setosa  1.633333
2          4.7         3.2          1.3         0.2  setosa   1.46875
3          4.6         3.1          1.5         0.2  setosa  1.483871
4            5         3.6          1.4         0.2  setosa  1.388889

There are some things to work out, but I also certainly think such a feature would be a nice addition! I am not fully sure on the mutate name (but maybe it is because I am not a native speaker)

@TomAugspurger
Copy link
Contributor Author

Sorry about the tricky example, but I guess it's a good one since it exposes a difficulty.

In my head I was thinking that it should refer to just the ones that meet the query of > 4.5, though obviously that's not how I wrote it. Maybe we have to go down the path of evaling strings and have

iris.query('sepal_length > 4.5').mutate('ratio=sepal_length / sepal_width')

but I was wanting to avoid that.

@mrocklin
Copy link
Contributor

The following is something like what Blaze does. It's less dplyr-ish and not as well chained, but there are things that one just can't do in Python without macros

df = iris[iris.sepal_length > 4.5]
df.transform(sepal_ratio = df.sepal_length / df.sepal_width, 
             petal_ratio = df.petal_length / df.petal_width)

@TomAugspurger
Copy link
Contributor Author

A bit of a summary,

  • For now I think we should ignore the query bit of my original example.
  • Joris' point about this being similar to append is good. Should we just add axis and name arguments to append (I don't think so)?
  • transform seems to be favored over mutate. We do already have a transform on GroupBy objects. Groupby.transform's defining feature is that the returned DataFrame will be like-indexed. Whatever we have here will not be like-indexed, is that potentially confusing? I'd be ok with transform

I think @mrocklin's example in his last post strikes a good balance. I'll put together a PR.

@shoyer
Copy link
Member

shoyer commented Jan 13, 2015

I was just talking about this with my co-worker and came up with another idea to try to replace the R's macros.

What about doing an automatic inject to insert local variables when the argument is a function with no arguments? Basically, we can have macros as long as we insert lambda: first.

For example: iris.mutate(ratio = lambda: sepal_length / sepal_width)

With some black magic, we could even extend this other methods to make something very dplyr like:

(iris[lambda: sepal_length > 4.5]
     .mutate(ratio = lambda: sepal_length / sepal_width)
     .groupby(lambda: pd.cut(ratio))
     .apply(lambda: ratio - ratio.mean()))

I'm not entirely sure this is a good idea! But it does make me less jealous of R users :).

Personally, I don't find the groupby transform method very useful (see #9235) and wouldn't mind deprecating it -- I always use apply. That said, new names are less confusing, especially since we might even want to allow mutate to act on groupby objects.

@shoyer
Copy link
Member

shoyer commented Jan 13, 2015

@jhorowitz-coursera wrote the very similar ply_select in pandas-ply. He uses a nice trick with X to automatically generate functions.

@TomAugspurger
Copy link
Contributor Author

@shoyer that's kinda-awesome / evil.

mailing list discussion about pandas-ply.

@TomAugspurger
Copy link
Contributor Author

I'm settling on a relatively simple implementation.

signature: DataFrame.transform(**kwargs)

  • the keyword is the name of the new column (existing columns are overwritten if there's a name conflict, as in dplyr)
  • the value is either
    • called on self if it's callable. The callable should be a function of 1 argument, the DataFrame being called on.
    • inserted otherwise
In [7]: df.head()
Out[7]: 
   sepal_length  sepal_width  petal_length  petal_width species
0           5.1          3.5           1.4          0.2  setosa
1           4.9          3.0           1.4          0.2  setosa
2           4.7          3.2           1.3          0.2  setosa
3           4.6          3.1           1.5          0.2  setosa
4           5.0          3.6           1.4          0.2  setosa

In [8]: (df.query('species == "virginica"')
           .transform(sepal_ratio=lambda x: x.sepal_length / x.sepal_width)
           .head())
Out[8]: 
     sepal_length  sepal_width  petal_length  petal_width    species  \
100           6.3          3.3           6.0          2.5  virginica   
101           5.8          2.7           5.1          1.9  virginica   
102           7.1          3.0           5.9          2.1  virginica   
103           6.3          2.9           5.6          1.8  virginica   
104           6.5          3.0           5.8          2.2  virginica   

     sepal_ratio  
100     1.909091  
101     2.148148  
102     2.366667  
103     2.172414  
104     2.166667  

This way we can handle the query case, where you don't have a reference to the DataFrame being passed in and it's not too magical (and it's easier to implement).

@TomAugspurger
Copy link
Contributor Author

Submitted a PR if you want to move the discussion there.

@shoyer
Copy link
Member

shoyer commented Jan 14, 2015

Just FYI, I think I'm going to put my separate "thunk" based proposal (with argument-less lambdas) into a separate package -- I already have a working proof of concept.

The most annoying thing is ensure that new dataframes remain macro friendly... that requires writing a lot of wrappers (probably unavoidable).

Also, I haven't been able to come up with a way to write hygentic macros. Probably not possible in Python.

@dalejung
Copy link
Contributor

You think a heavy handed approach to DSLs would gain traction? I've been experimenting with taking over evaluation completely via import hooks and IPython input transformers.

https://github.com/dalejung/naginpy.

It's kind of all over the map as I have multiple use-cases in mind and figuring out how to support all of them cleanly.

I had tried to get what I wanted via just AST transforms and fun stuff like contextmanagers that introduce temporary scopes. But they were always lacking the synatx I wanted.

@shoyer
Copy link
Member

shoyer commented Jan 14, 2015

@dalejung What does it look like to enable your DSLs in a script? My gut is that full DSLs are probably too painful for me to integrate them into my workflows.

@dalejung
Copy link
Contributor

@shoyer depends I suppose. The only hard requirement is that the Import machinery is installed. I generally run scripts via %run which works with my IPython profile. I don't think a straight python script.py will ever work, it would have to be a python -m naginpy script.py.

Haven't decided how I want to signal modules on/off. I use a sentinel value for datamodule to tell it whether to use the custom loader. Might do the same with naginpy.

@datnamer
Copy link

@datnamer
Copy link

datnamer commented Feb 3, 2015

@TomAugspurger @jreback @shoyer Have you guys seen this library? Has an interesting take on the situation https://github.com/coursera/pandas-ply

@shoyer
Copy link
Member

shoyer commented Feb 3, 2015

@datnamer Yes, we have, see my comment above :).

@datnamer
Copy link

datnamer commented Feb 3, 2015

@shoyer oops missed that, sorry for the noise

TomAugspurger pushed a commit to TomAugspurger/pandas that referenced this issue Mar 1, 2015
Creates a new method for DataFrame, based off dplyr's mutate.
Closes pandas-dev#9229
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants