-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: #34380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes, agreed. Instead of producing wrong results, it should raise a Minimum reproducible example. df = pd.DataFrame({'token': [12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0],
'name': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'],
'ltp': [2.0, 5.0, 3.0, 9.0, 5.0, 16.0, 1.0, 5.0, 3.0, 13.0, 9.0, 20.0],
'change': [np.nan, 1.5, -0.4, 2.0, -0.44444399999999995, 2.2, np.nan, 4.0, -0.4, 3.333333, -0.307692, 1.222222]})
#with only 1 named aggregation :
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()))
#Output
pos
name
abc 3.0
xyz 3.0 #with 2 named aggregation on the same column
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
sum = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()))
#Output the values of pos column are overwrittem by sum column values
pos sum
name
abc 4.855556 4.855556 #The values of `pos` column are over-written by `sum` column's values
xyz 7.847863 7.847863 # with 3 named aggregation on the same column, the values of previous columns are replaced by the lastest aggregation.
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
sum = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
max = pd.NamedAgg(column='change',aggfunc='max'))
pos sum max
name
abc 2.2 2.2 2.2 #The values of `pos`, `sum` are over-written by `max` column's values.
xyz 4.0 4.0 4.0 pd.__version__
# '1.0.3'
INSTALLED VERSIONScommit : None pandas : 1.0.3 Link to relevant SO post |
|
I guess the problem is when using with Minimum reproducible example: s = pd.Series([1,1,2,2,3,3,4,5])
s.groupby(s.values).agg(one = pd.NamedAgg(column='new',aggfunc='sum'))
#Output
one
1 2
2 4
3 6
4 4
5 5 I expected it to raise The values of previous column are over-written when same column same is specified in s.groupby(s.values).agg(one=pd.NamedAgg(column='weird',aggfunc='sum'),\
second=pd.NamedAgg(column='weird',aggfunc='max'))
#Output
one second # Values of column `one` are over-written
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5 When different column names are provided it doesn't over-write. s.groupby(s.values).agg(one=pd.NamedAgg(column='anything',aggfunc='sum'),\
second=pd.NamedAgg(column='something',aggfunc='max'))
#Output
one second #Values are not over written
1 2 1
2 4 2
3 6 3
4 4 4
5 5 5 When previous aggregation values are overwritten when the same column name is fed to s.groupby(s.values).agg(one=pd.NamedAgg(column='weird',aggfunc='sum'),\
second=pd.NamedAgg(column='diff',aggfunc='max'),
third = pd.NamedAgg(column='weird',aggfunc=lambda x: ','.join(map(str,x))))
#Output
one second third
1 1,1 1 1,1
2 2,2 2 2,2
3 3,3 3 3,3
4 4 4 4
5 5 5 5 The aggregation values of Link to relevant SO post |
Many thanks for the report
I'll test this out this evening, but I'd like to think it'd be solved by this PR: #30858 - will update later when I've tried |
@MarcoGorelli Thank you for the reply. If the test results indicate there's a bug, I'll be more than glad to work on a PR. |
@gurukiran07 I just checkout at that PR, tried your minimal reproducible exaple above, and got In [4]: df = pd.DataFrame({'token': [12345.0, 12345.0, 12345.0, 12345.0, 12345.0
...: , 12345.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0],
...: 'name': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'x
...: yz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'],
...: 'ltp': [2.0, 5.0, 3.0, 9.0, 5.0, 16.0, 1.0, 5.0, 3.0,
...: 13.0, 9.0, 20.0],
...: 'change': [np.nan, 1.5, -0.4, 2.0, -0.444443999999999
...: 95, 2.2, np.nan, 4.0, -0.4, 3.333333, -0.307692, 1.222222]})
...:
...: #with only 1 named aggregation :
...: df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfu
...: nc=lambda x:x.gt(0).sum()))
...: #Output
Out[4]:
pos
name
abc 3.0
xyz 3.0
In [5]: df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfu
...: nc=lambda x:x.gt(0).sum()),\
...: sum = pd.NamedAgg(column='change',aggfu
...: nc=lambda x:x.sum()))
...:
Out[5]:
pos sum
name
abc 3.0 4.855556
xyz 3.0 7.847863
In [6]:
...: df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfu
...: nc=lambda x:x.gt(0).sum()),\
...: sum = pd.NamedAgg(column='change',aggfu
...: nc=lambda x:x.sum()),\
...: max = pd.NamedAgg(column='change',aggfu
...: nc='max'))
...:
Out[6]:
pos sum max
name
abc 3.0 4.855556 2.2
xyz 3.0 7.847863 4.0 so it seems it's fixed. I'll add a test case to explicitly use |
@MarcoGorelli Good to know it's fixed. One more thing I would like to suggest. Minimum reproducible example: s = pd.Series([1,1,2,2,3,3,4,5])
s.groupby(s.values).agg(one=pd.NamedAgg(column='anything',aggfunc='sum'),\
second=pd.NamedAgg(column='something',aggfunc='max'))
#Output
one second
1 2 1
2 4 2
3 6 3
4 4 4
5 5 5 Expected Output: KeyError : "Column 'anything' does not exist!" Or TypeError : got an unknown keyword column It works with any name given to |
@gurukiran07 TBH it's not clear to me what the user should pass as |
@MarcoGorelli I was trying to convey the same you understood correctly. Disallowing Something like this s.groupby(s.values).agg(pos = pd.NamedAgg(aggfunc=lambda x :x.sum())) Currently it raises |
We document the behavior for Series in https://pandas.pydata.org/docs/user_guide/groupby.html#named-aggregation
So for SeriesGroupBy.agg, if there are any tuples or |
@TomAugspurger Thank you for the reply. Yes, raising an error would much better and consistent for
Changing how agg of SeriesGroupBy(link SeriesGroupBy class) works can be helpful without disturbing |
NamedAgg is just a namedtuple. There’s nothing to be done there. The problem is solely in SerisGroupBy |
@gurukiran07 as this would be separate, if you're interested in working on it, could you open an new issue for it? |
@TomAugspurger and @MarcoGorelli Yes I would be more than glad to work on it, I identified where the changes can be made so that an Error can be raised when Adding some additional constraints in here can solve the issue groupby/generic.py. I'll open a new issue and will start working on it and hopefully open a PR. Thank you. |
I think this was fixed by #34435 |
From SO:
After specified column after groupby output is incorrect.
Correct using named aggregation:
Wrong output:
Correct using column after groupby:
In my opinion instead wrong output it should raise error/warning.
Version of pandas:
The text was updated successfully, but these errors were encountered: