-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Deprecation of relabeling dicts in groupby.agg brings many issues #18366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I agree that renaming with the current I suppose there could be a So the syntax would turn into:
Or maybe, an entire new method
I actually like this option much better. |
For what it's worth, I am also strongly in favour of not depreciating the functionality. A big reason for me is that there is something deeply queer about mixing the Python's function name-space (something to do with the particular implementation) with the data the column names (something that should surely not know about the implementation). The fact that we are seeing columns (potentially multiple columns) named The renaming approach grates, because there is this intermediary step where unnecessary (and exposed) column names are carried around. Furthermore, they are difficult to reliably, systematically rename because there are potentially dependencies on the implementation. Aside from that, the nested dict functionality is admittedly complex, but it is a complex operation that is being performed. TL;DR Please don't depreciate. :) |
My contribution is motivated by two things.
Also, the Pandas Series and DataFrame objects have had I'll use @zertrin 's example
Which results in
The |
I really like tdpetrou's idea - to use: This can make everyone happy. It gives us a possibility to rename columns easily as we wish. |
Not really, as mentioned in my initial post, this would not solve the issue of decoupling the place where the aggregate operation is defined from the name of the resulting column, requiring an extra effort to make sure both are "synchronized". I don't say that's a bad solution (after all it solves the other issues), but it wouldn't be as easy and clear as the dict of dict approach. I mean here that at writing time you need to keep both dicts of lists synchronized, and when reading the source, the reader must make an effort to match the names in the second dict of lists with the aggregate definition in the first dict of lists. That's twice the effort in each case.
I still don't understand why everyone seems to say that dict of dict is complex. To me that's the clearest way of doing it. That said, if the |
@pirsquared interesting solution with current API. Albeit in my opinion not quite easy to grasp (i don't really understand how it works 😕 ) |
I started a thread on the datascience subreddit - What do you hate about pandas?. Someone brought up their contempt for the returned MultiIndex after a @zertrin Any thoughts @jreback @TomAugspurger ? |
Although this solves the problem, one needs to align keys and values in two places. I think an API (as suggested for There is also the issue of the clean up code after using the API. When Based on usage patterns, I think multi-index outputs should be strictly opt-in and not opt-out. |
I was initially skeptical about the Especially thinking about the possibility to use it in the form
That would be quite flexible and solve all the issues mentioned in my OP. |
I was thinking about this a bit more and this functionality exists already with
So, there might not be a need for a new method and instead just a better example in the docs. |
@tdpetrou, you are correct. I had forgotten how |
Hum indeed, there is no chance I would have thought about using it in a aggregation context just by reading the doc however... Since there was never really a statement about it, is the Except for the |
The benefit and drawback of the The drawback is that your nice, generic aggregation functions now have to be concerned with column selection. There's no free lunch! That means you'll end up with many helpers like I'm not sure about these tradeoffs vs. the dict-of-dicts method, but I'm curious to hear other people's thoughts. Here's a rough sketch of from collections import defaultdict
import pandas as pd
import numpy as np
from pandas.core.groupby import DataFrameGroupBy
mydf = pd.DataFrame(
{
'cat': ['A', 'A', 'A', 'B', 'B', 'C'],
'energy': [1.8, 1.95, 2.04, 1.25, 1.6, 1.01],
'distance': [1.2, 1.5, 1.74, 0.82, 1.01, 0.6]
},
index=range(6)
)
def agg_table(self, **kwargs):
output = defaultdict(dict)
for group in self.groups:
for k, v in kwargs.items():
output[k][group] = v(self.get_group(group))
return pd.concat([pd.Series(output[k]) for k in output],
keys=list(output),
axis=1)
DataFrameGroupBy.agg_table = agg_table Usage >>> gr = mydf.groupby("cat")
>>> gr.agg_table(n=len,
foo=lambda x: x.energy.min(),
bar=lambda y: y.distance.min())
n foo bar
A 3 1.80 1.20
B 2 1.25 0.82
C 1 1.01 0.60 I suspect we could do a bit to make the performance of this less awful, but not nearly as much as |
Could someone from the Pandas Core Team please explain what is the main reason for deprecating of relabeling dicts in I could easily understand if it causes too much problems to maintain the code, but if it's about complexity for the end user - i would also opt for bringing it back, as it's pretty clear compared to needed workarounds... Thank you! |
Did you see https://github.com/pandas-dev/pandas/pull/15931/files#diff-52364fb643114f3349390ad6bcf24d8fR461? The primary reason was that dict-keys were overloaded to do two things. For Series / SeriesGroupBy, they're for naming. For DataFrame/DataFrameGroupBy, they're for selecting a column. In [32]: mydf.aggregate({"distance": "min"})
Out[32]:
distance 0.6
dtype: float64
In [33]: mydf.aggregate({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
#!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[33]:
distance
foo 0.6
In [34]: mydf.distance.agg({"foo": "min"})
Out[34]:
foo 0.6
Name: distance, dtype: float64
In [35]: mydf.groupby("cat").agg({"distance": {"foo": "min"}})
/Users/taugspurger/Envs/pandas-dev/lib/python3.6/site-packages/pandas/pandas/core/groupby.py:4201: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)
Out[35]:
distance
foo
cat
A 1.20
B 0.82
C 0.60
In [36]: mydf.groupby("cat").distance.agg({"foo": "min"})
/Users/taugspurger/Envs/pandas-dev/bin/ipython:1: FutureWarning: using a dict on a Series for aggregation
is deprecated and will be removed in a future version
#!/Users/taugspurger/Envs/pandas-dev/bin/python3.6
Out[36]:
foo
cat
A 1.20
B 0.82
C 0.60 This isn't the most confusing thing in pandas probably, so perhaps we could revisit it :) I'm presumably missing some edge cases. But even if we do remove dict-of-dicts aggregations, we still have the inconsistency between naming and column selection: For Series / SeriesGroupBy the dictionary keys are always for naming the output. For DataFrame / DataFrameGroupby, the dict keys are always for selection. With dict-of-dicts we select a column, and then the inner dict is for naming the output, just like Series / SeriesGroupBy. |
We discussed this briefly before (somewhere in the long discussion about the deprecation), and I proposed something similar here: #14668 (comment). But in the end only the deprecation was implemented, and not the ideas for making the other functionality of using dicts (the 'renaming' function) easier. The problem was that dicts were both used for 'selection' (on which column do you want this function to be applied) and 'renaming' (what should be the resulting column name when applying this function). An alternative syntax, apart from dicts, could be keyword arguments, as is discussed here in the What I proposed back then was something similar to
I am not sure this is necessarily more readable or easier to write as the version with all lambdas, but, this one could potentially be more performant, as pandas can still use the optimized implementations for sum, mean, etc on those columns where you do not have a lambda or user specified function. A big question with this approach would be what I think the above can be done backwards compatible inside the existing
(and you could even consider to do the above one time for 'distance' and once for 'energy' and concat the result if you don't like all the dicts / lambda's) @TomAugspurger In your example simple implementation of |
BTW, @zertrin @tdpetrou @smcateer @pirsquared and others, thanks a lot for raising this issue and giving such detailed feedback. Such feedback and community involvement is very important! |
I actually really like the pattern suggested by @tdpetrou (using apply with a function that returns a Series) - probably even better than the dict of dicts. If the function returns Edit: sorry, I misunderstood the point of the index argument (it is optional here, only needed if you want to specify the order of the columns - returning |
Would @tdpetrou's example work with I had to resort to head/tail like this def agg_funcs(x):
data = {'start':x['DATE_TIME'].head(1).values[0],
'finish':x['DATE_TIME'].tail(1).values[0],
'events':len(x['DATE_TIME'])}
return pd.Series(data, index = list(data.keys()))
results = df.groupby('col').apply(agg_funcs) |
I'd still like to address this, but I don't think it'll be done for 0.23. |
Could @tdpetrou's approach work without defining a function that we will never use again in our code? Coming from a Q/Kdb+ world (similar to SQL) I am confused why we need to create any temporal variable/function for a simple select statement. |
OP here. Honestly, after all this time and the plenty of discussion in #15931 and here, I am still not convinced that this is a good idea to deprecate relabeling dicts. In the end, none of the alternatives proposed here are more intuitive to the users than the current relabeling dict approach IMHO. When it was in the documentation, just with one example it was clear how this works, and it is very flexible. Of course pandas developers may still think otherwise, just chiming in with the point of view of an user. |
@jorisvandenbossche I have considered your approach. The thing is, I don't really see what additional advantages it brings over the current approach. To put it more bluntly, given the following choices:
I don't see why we should choose 2 unless it brings meaningful and tangible advantages from a developer and user perspective. To address some of the points in your proposal above:
Since it was nicely documented before, I don't believe it was an issue for users. Personally, I got the point immediately looking at the examples in the documentation. (EDIT: and I thought: "yay! very useful construct, it exactly matches what I was looking for. Nice.")
One of the attractive thing to use the dict-of-dict approach is that users can be easily generate it dynamically with some other code. As you pointed out in the comment just above this one, moving to keyword arguments as in your proposition would still allow for this via the In the end, you proposal would be okay if pandas core dev have a strong feeling against the current approach. I just don't see any benefit as a user. (in fact I see the drawback of changing all existing user code to make it work again with the new proposition). |
@zertrin we discussed this yesterday with some core devs, but didn't get to writing the summary here. So now I am going to do that, before answering to your comment, to only reflect our thoughts of yesterday. So first to state, the notion that a basic functionality like the SQL "SELECT avg(col2) as col2_avg" should work and be easy, is something we completely agree on, and we really want to have a solution for this. Apart from the original reasons we decided to deprecate this (which may or may not be that strong), the current (deprecated) dicts of dicts is also not that ideal, as this creates a MultiIndex that you actually never want:
In the above, the first level of the MultiIndex is superfluous, as you already specifically renamed the columns (in the example in the OP, this is also directly followed by dropping the first level of the columns). So one of the proposals that was mentioned in the discussion above, was to have a way to specify the final column names separately (eg #18366 (comment)).
would be possible.
This needs #14829 to be solved (something we want to do for 0.24.0). Then, we still like the way of the keyword arguments for renaming. Reasons for this are:
We still had a bit of discussion how it could look like. What I proposed above was (to use the equivalent column/function selection as in my first examples):
You still can build up this specification as a dict of dicts, but with the inner and outer level swapped compared to the current (deprecated) version:
(we could have a example helper function that converts the existing dicts of dicts to this version) However, the dict is always only a single
This looks a bit better, but on the other hand the Both suggestions above (the easier renaming afterwards, and the keyword-based naming) are in principle orthogonal, but it could be nice to have both (or still something else based on further discussion) |
Thanks for forwarding here the current thoughts from the devs 😃 I acknowledge the (, in my opinion, only) drawback of the deprecated dict-of-dict approach with the resulting MultiIndex. Could be flattened if the user pass an additional option (yeah YAO :-/ ). As mentioned, I am not against the second version, as long as it stays possible to:
As such, the last suggestion (with dicts or tuples for the col>func mapping) is okay I think. The first proposition in the previous comment can be implemented if you really want to, but my feedback on this is that, as a user, I wouldn't choose to use it over the second alternative because of the pain of keeping things in sync between two lists. |
Discussed at the dev meeting today. Short summary
|
Maybe this helps: my workaround for the deprecation are these helper functions that replaces the alias->aggr maps with list of correctly named functions: def aliased_aggr(aggr, name):
if isinstance(aggr,str):
def f(data):
return data.agg(aggr)
else:
def f(data):
return aggr(data)
f.__name__ = name
return f
def convert_aggr_spec(aggr_spec):
return {
col : [
aliased_aggr(aggr,alias) for alias, aggr in aggr_map.items()
]
for col, aggr_map in aggr_spec.items()
} which gives the old behaviour with:
which is the same as mydf_agg = mydf.groupby('cat').agg({
'energy': [
aliased_aggr('sum', 'total_energy'),
aliased_aggr(lambda x: np.percentile(x, 98), 'energy_p98'),
aliased_aggr(lambda x: np.percentile(x, 17), 'energy_p17')
],
'distance': [
aliased_aggr('sum', 'total_distance'),
aliased_aggr('mean', 'average_distance'),
aliased_aggr(smrb.mad, 'distance_mad'),
aliased_aggr(mad_c1, 'distance_mad_c1'),
]
}) This works for me, but probably won't work in some corner cases ... Update: found out that renaming is not necessary, as tuples in an aggregation specification are interpreted as (alias, aggr). So the alias_aggr function is not needed, and the conversion becomes: def convert_aggr_spec(aggr_spec):
return {
col : [
(alias,aggr) for alias, aggr in aggr_map.items()
]
for col, aggr_map in aggr_spec.items()
} |
I just want to chime in here as yet another user who is really, really missing the functionality of aggregating a column on any function and immediately renaming it in the same row. I have never found myself using the MultiIndex returned by pandas - I either flatten it immediately, or I actually want to manually specify my column names because they actually mean something specific. I would be happy with any of the approaches proposed here: SQL-like syntax ( I actually find myself using I recently even found myself using PySpark instead of pandas even though it wasn't necessary, just because I like the syntax so much more:
Also PySpark is way more convoluted to write than pandas in most cases, this just looks so much cleaner! So I definitely appreciate that work on this is still be done :-) |
I think we have an agreed syntax for this functionality; we need someone to
implement it.
…On Wed, Mar 27, 2019 at 9:01 AM Thomas Kastl ***@***.***> wrote:
I just want to chime in here as yet another user who is really, really
missing the functionality of aggregating a column on any function and
immediately renaming it in the same row. I have *never* found myself
using the MultiIndex returned by pandas - I either flatten it immediately,
or I actually want to manually specify my column names because they
actually mean something specific.
I would be happy with any of the approaches proposed here: SQL-like syntax
( I actually find myself using .query() a lot in pandas already),
reverting to the depreciated behavior, any of the other suggestions. The
current approach already brought me ridicule from colleagues who use R.
I recently even found myself using PySpark instead of pandas even though
it wasn't necessary, just because I like the syntax so much more:
df.groupby("whatever").agg( F.max("col1").alias("my_max_col"),
F.avg("age_col").alias("average_age"),
F.sum("col2").alias("total_yearly_payments") )
Also PySpark is way more convoluted to write than pandas in most cases,
this just looks so much cleaner! So I definitely appreciate that work on
this is still be done :-)
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#18366 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIkCYYsah5siYA4_z0oop_ufIB3h8ks5va3nJgaJpZM4QjSLL>
.
|
I'm going to try to get to this for 0.25.0 |
I've put up a PR at #26399. The basic idea is to allow this mixture of rename & column-specific aggregation by using In [2]: df = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
...: 'height': [9.1, 6.0, 9.5, 34.0],
...: 'weight': [7.9, 7.5, 9.9, 198.0]})
In [3]: df
Out[3]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [4]: df.groupby('kind').agg(min_height=('height', 'min'), max_weight=('weight', 'max'))
Out[4]:
min_height max_weight
kind
cat 9.1 9.9
dog 6.0 198.0 This has a few limitations
And there's an implementation detail, multiple I suspect that most people subscribed here would be supportive of some alternative to the deprecated behavior. What do people think of this one specifically? cc @WillAyd if I missed any of your concerns. |
Hi @TomAugspurger, Thanks for making this move forward.
I can't help but feel that this kind of argument seems quite similar to the one that motivated deprecating the existing implementation at the first place. Could you share why we benefit more from this new way over the old one with respect to that particular argument? One benefit that I could already think of is that (for py3.6+) we can select the output order of the columns individually.
Somehow, the old way was better in that respect. But as I said earlier, as long as it is possible to use the
How did it work before (existing dict-of-dict feature)? was the order guaranteed in some way?
Just to confirm my understanding: the aggfunc can be any callable that returns a valid value right? (in addition to the "often used" string aggfungs like
Yeah that one is kind of annoying, but as long as it is just a temporary limitation and it is open to fix this, that could work.
Well, in any case I think that aggregate and rename in one step is really important to keep. If the old behavior is really not an option, then this alternative could do. |
I may be misremembering, but I believe SeriesGroupby.agg and DataFrameGroupby.agg have different meanings between the outer key in a dictionary (is it a column selection or an output naming?). With this syntax, we can consistently have the keyword mean the output name.
Is the difference just the
Sorting the keys, which is what I'm doing over in the PR now.
Here's the difference In [21]: df = pd.DataFrame({"A": ['a', 'a'], 'B': [1, 2], 'C': [3, 4]})
In [22]: def aggfunc(x, myarg=None):
...: print(myarg)
...: return sum(x)
...:
In [23]: df.groupby("A").agg({'B': {'foo': aggfunc}}, myarg='bar')
/Users/taugspurger/sandbox/pandas/pandas/core/groupby/generic.py:1308: FutureWarning: using a dict with renaming is deprecated and will be removed in a future version
return super().aggregate(arg, *args, **kwargs)
None
Out[23]:
B
foo
A
a 3 with the alternative proposal, we're reserving |
Ok thanks, I think the proposed approach is 👍 for a first iteration (and will really be ok as a replacement as soon as the multiple lambda implementation limitation is removed) |
This issue is created based on the discussion from #15931 following the deprecation of relabeling dicts in
groupby.agg
. A lot of what is summarized below was already discussed in the previous discussion. I would recommend in particular #15931 (comment) where the problems are also clearly stated.The motivation behind the deprecation of #15931 was mostly related to bringing a consistent interface for
agg()
between Series and Dataframe (see also #14668 for context).The relabeling functionality with a nested dict has been described by some as being too complex and/or inconsistent and thus deprecated.
However, this comes at a price: the impossibility to aggregate and rename at the same time leads to very annoying issues and some backward incompatibility where no sensible workaround is available:
__name__
attributeExample
(please note, this is a crafted example for the purpose of demonstrating the problem in as short a code as possible, but all of the demonstrated issues here did bite me in real life since the change, and in situations not as simple as here)
Input Dataframe
Before:
easy to write and read, and works as expected
results in
and all is left is:
Happy dance praising pandas 💃 🕺 !
After
The above breaks because the lambda functions will all result in columns named
<lambda>
which results inBackward incompatible regression: one cannot apply two different lambdas to the same original column anymore.
If one removes the
lambda x: np.percentile(x, 98)
from above, we get the same issue with the partial function which inherits the function name from the original function:Finally, after overwriting the
__name__
attribute of the partial (for example withmad_c1.__name__ = 'mad_c1'
) we get:with still
to deal with in separate step.
There is no control possible for the column names after aggregation, the best we can get in an automated way is some combination of original column name and the aggregate function's name like this:
which results in:
and if you really need to have different names, you can do it like this:
but that means that you need to be careful to keep the renaming code (which must now be located at another place in the code) in sync with the code where the aggregation is defined...
Sad pandas user 😢 (which still loves pandas of course)
I am all in for consistency, and at the same time I deeply regret the deprecation of the aggregate and rename functionality. I hope the examples above make the pain points clear.
Possible solutions
Optional read:
With respect to the aforementioned discussion in the pull request which has been going on already for a few months, I only recently realized one of the reasons why I am so bothered by this deprecation: "aggregate and rename" is a natural thing to do with GROUP BY aggregations in SQL since in SQL you usually provide the destination column name directly next to the aggregation expression, e.g.
SELECT col1, avg(col2) AS col2_mean, stddev(col2) AS col2_var FROM mytable GROUP BY col1
.I'm not saying that Pandas should necessarily provide the same functionalities as SQL of course. But the examples provided above demonstrate why the dict-of-dict API was in my opinion a clean and simple solution to many use-cases.
(* I don't personally agree that the dict-of-dict approach is complex.)
The text was updated successfully, but these errors were encountered: