-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API/ENH: Add mutate like method to DataFrames #9229
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
If this syntax were supported (and not that difficult), IOW, a multi-assignment.
|
@TomAugspurger can you add a mini example (of split-apply-combine), showing how current (and potentially new) syntax would work? |
Added an example to the end of the original post. |
Allowing multiple new columns to be added shouldn't be too hard. dplyr does allow calculations to refer to new columns within the same mutate, e.g. mutate(flights,
gain = arr_delay - dep_delay,
gain_per_hour = gain / (air_time / 60)
) The way I'm seeing the function signature right now is |
I really like this idea. @TomAugspurger I'm not sure what to make of this example:
In particular, are you suggesting that Note that @mrocklin added similar dplyr like syntax to blaze: blaze/blaze#484 This is definitely a place where Python's syntax (and limited magic, which is usually a good thing) makes things a little trickier than in R. |
FWIW, the semantics of Transform does not allow one new column in a mutate to depend on another new column in mutate. It also doesn't have the word |
I had the same question as @shoyer: Further, I was also thinking this looks very much like
There are some things to work out, but I also certainly think such a feature would be a nice addition! I am not fully sure on the |
Sorry about the tricky example, but I guess it's a good one since it exposes a difficulty. In my head I was thinking that it should refer to just the ones that meet the query of > 4.5, though obviously that's not how I wrote it. Maybe we have to go down the path of
but I was wanting to avoid that. |
The following is something like what Blaze does. It's less dplyr-ish and not as well chained, but there are things that one just can't do in Python without macros
|
A bit of a summary,
I think @mrocklin's example in his last post strikes a good balance. I'll put together a PR. |
I was just talking about this with my co-worker and came up with another idea to try to replace the R's macros. What about doing an automatic For example: With some black magic, we could even extend this other methods to make something very dplyr like: (iris[lambda: sepal_length > 4.5]
.mutate(ratio = lambda: sepal_length / sepal_width)
.groupby(lambda: pd.cut(ratio))
.apply(lambda: ratio - ratio.mean())) I'm not entirely sure this is a good idea! But it does make me less jealous of R users :). Personally, I don't find the groupby |
@jhorowitz-coursera wrote the very similar |
@shoyer that's kinda-awesome / evil. mailing list discussion about pandas-ply. |
I'm settling on a relatively simple implementation. signature:
In [7]: df.head()
Out[7]:
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
In [8]: (df.query('species == "virginica"')
.transform(sepal_ratio=lambda x: x.sepal_length / x.sepal_width)
.head())
Out[8]:
sepal_length sepal_width petal_length petal_width species \
100 6.3 3.3 6.0 2.5 virginica
101 5.8 2.7 5.1 1.9 virginica
102 7.1 3.0 5.9 2.1 virginica
103 6.3 2.9 5.6 1.8 virginica
104 6.5 3.0 5.8 2.2 virginica
sepal_ratio
100 1.909091
101 2.148148
102 2.366667
103 2.172414
104 2.166667 This way we can handle the query case, where you don't have a reference to the DataFrame being passed in and it's not too magical (and it's easier to implement). |
Submitted a PR if you want to move the discussion there. |
Just FYI, I think I'm going to put my separate "thunk" based proposal (with argument-less lambdas) into a separate package -- I already have a working proof of concept. The most annoying thing is ensure that new dataframes remain macro friendly... that requires writing a lot of wrappers (probably unavoidable). Also, I haven't been able to come up with a way to write hygentic macros. Probably not possible in Python. |
You think a heavy handed approach to DSLs would gain traction? I've been experimenting with taking over evaluation completely via import hooks and IPython input transformers. https://github.com/dalejung/naginpy. It's kind of all over the map as I have multiple use-cases in mind and figuring out how to support all of them cleanly. I had tried to get what I wanted via just AST transforms and fun stuff like contextmanagers that introduce temporary scopes. But they were always lacking the synatx I wanted. |
@dalejung What does it look like to enable your DSLs in a script? My gut is that full DSLs are probably too painful for me to integrate them into my workflows. |
@shoyer depends I suppose. The only hard requirement is that the Import machinery is installed. I generally run scripts via Haven't decided how I want to signal modules on/off. I use a sentinel value for |
@shoyer @dalejung there is also this: https://github.com/lihaoyi/macropy |
@TomAugspurger @jreback @shoyer Have you guys seen this library? Has an interesting take on the situation https://github.com/coursera/pandas-ply |
@datnamer Yes, we have, see my comment above :). |
@shoyer oops missed that, sorry for the noise |
Creates a new method for DataFrame, based off dplyr's mutate. Closes pandas-dev#9229
In my notebook comparing dplyr and pandas, I gained a new level of appreciation for the ability to chain strings of operations together. In my own code, the biggest impediment to this is adding additional columns that are calculations on existing columns. For example
vs.
just doesn't flow as nicely, especially if this
mutate
is in the middle of a chain.I'd propose a new method (perhaps stealing
mutate
) that's similar to dplyr's.The function signature could be kwarg only, where the keywords are the new column names. e.g.
This would return a DataFrame with the new column
gain
in addition to the original columns.Worked out example
Thoughts?
The text was updated successfully, but these errors were encountered: