BUG: #34380

jesrael · 2020-05-26T06:14:40Z

From SO:

After specified column after groupby output is incorrect.

df = pd.DataFrame({'token': [12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0], 'name': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'], 'ltp': [2.0, 5.0, 3.0, 9.0, 5.0, 16.0, 1.0, 5.0, 3.0, 13.0, 9.0, 20.0], 'change': [np.nan, 1.5, -0.4, 2.0, -0.44444399999999995, 2.2, np.nan, 4.0, -0.4, 3.333333, -0.307692, 1.222222]})
print (df)
      token name   ltp    change
0   12345.0  abc   2.0       NaN
1   12345.0  abc   5.0  1.500000
2   12345.0  abc   3.0 -0.400000
3   12345.0  abc   9.0  2.000000
4   12345.0  abc   5.0 -0.444444
5   12345.0  abc  16.0  2.200000
6    6789.0  xyz   1.0       NaN
7    6789.0  xyz   5.0  4.000000
8    6789.0  xyz   3.0 -0.400000
9    6789.0  xyz  13.0  3.333333
10   6789.0  xyz   9.0 -0.307692
11   6789.0  xyz  20.0  1.222222

Correct using named aggregation:

df0 = df.groupby('name').agg(pos=pd.NamedAgg(column='change',aggfunc=lambda x: x.gt(0).sum()),\
                            neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
print (df0)
      pos  neg
name          
abc   3.0  2.0
xyz   3.0  2.0

Wrong output:

df1 = df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.lt(0).sum()))
    
print (df1)
      pos  neg
name          
abc   2.0  2.0
xyz   2.0  2.0

Correct using column after groupby:

df2 = df.groupby('name')['change'].agg(pos = lambda x:x.gt(0).sum(),\
                                      neg = lambda x:x.lt(0).sum())
print (df2)
      pos  neg
name          
abc   3.0  2.0
xyz   3.0  2.0

In my opinion instead wrong output it should raise error/warning.

Version of pandas:

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.7.6.final.0
python-bits      : 64
OS               : Windows
OS-release       : 7
machine          : AMD64
processor        : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel
byteorder        : little
LC_ALL           : None
LANG             : en
LOCALE           : None.None

pandas           : 1.0.1
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.3.5
hypothesis       : 5.5.4
sphinx           : 2.4.0
blosc            : None
feather          : None
xlsxwriter       : 1.2.7
lxml.etree       : 4.5.0
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.12.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.2
fastparquet      : None
gcsfs            : None
lxml.etree       : 4.5.0
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.3
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : 5.3.5
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : 1.3.13
tables           : 3.6.1
tabulate         : None
xarray           : None
xlrd             : 1.2.0
xlwt             : 1.3.0
xlsxwriter       : 1.2.7
numba            : 0.48.0
None

The text was updated successfully, but these errors were encountered:

gurukiran07 · 2020-05-26T06:49:35Z

Yes, agreed. Instead of producing wrong results, it should raise a ValueError or TypeError. The values of the previous aggregation are overwritten by the latest aggregated values.

Minimum reproducible example.

df = pd.DataFrame({'token': [12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 12345.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0],
                   'name': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'], 
                   'ltp': [2.0, 5.0, 3.0, 9.0, 5.0, 16.0, 1.0, 5.0, 3.0, 13.0, 9.0, 20.0], 
                   'change': [np.nan, 1.5, -0.4, 2.0, -0.44444399999999995, 2.2, np.nan, 4.0, -0.4, 3.333333, -0.307692, 1.222222]})

#with only 1 named aggregation :
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()))
#Output 
     pos
name
abc   3.0
xyz   3.0

#with 2 named aggregation on the same column
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 sum = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()))
#Output the values of pos column are overwrittem by sum column values
           pos       sum
name
abc   4.855556  4.855556    #The values of `pos` column are over-written by `sum` column's values
xyz   7.847863  7.847863

# with 3 named  aggregation on the same column, the values of previous columns are replaced by the lastest aggregation.

df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 sum = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
                                 max = pd.NamedAgg(column='change',aggfunc='max'))

      pos  sum  max
name
abc   2.2  2.2  2.2          #The values of `pos`, `sum` are over-written by `max` column's values.
xyz   4.0  4.0  4.0

pd.__version__
# '1.0.3'

pd.show_versions()

INSTALLED VERSIONS

commit : None
python : 3.7.4.final.0
python-bits : 64
OS : Windows
OS-release : 10
machine : AMD64
processor : Intel64 Family 6 Model 94 Stepping 3, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.None

pandas : 1.0.3
numpy : 1.18.3
pytz : 2019.3
dateutil : 2.8.1
pip : 20.1.1
setuptools : 46.1.3
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.2.1
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : None
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

This should raise an error or should be specified in the documentation explicitly about this behaviour, this can be a common pitfall for beginners.

Link to relevant SO post

gurukiran07 · 2020-05-26T08:20:07Z

`More weird result:`

Using the same dataframe from the above-mentioned comment. pd.NamedAgg over Series produces some weird results.

#df is the same as above
df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
                                 max = pd.NamedAgg(column='ltp',aggfunc='max'))
# Output
           pos       neg  max
name
abc   4.855556  4.855556  2.2
xyz   7.847863  7.847863  4.0

2.2 is the max value of 'change' column(group key is 'abc')
4.0 is the max value of 'change' column(group key is 'xyz')
I would expect it to raise an error saying KeyError: "Column 'ltp' does not exist!". Instead it produces results but it doesn't affect the previous columns like mentioned in issue and the comment above.

Here I used a completely new column name which doesn't exist.

df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfunc=lambda x:x.gt(0).sum()),\
                                 neg = pd.NamedAgg(column='change',aggfunc=lambda x:x.sum()),\
                                 max = pd.NamedAgg(column='Anything',aggfunc='max'))

#Output 
           pos       neg  max
name
abc   4.855556  4.855556  2.2
xyz   7.847863  7.847863  4.0

2.2 is the max value of 'change' column(group key is 'abc')
4.0 is the max value of 'change' column(group key is 'xyz')

Shouldn't it raise an KeyError

gurukiran07 · 2020-05-26T08:51:24Z

I guess the problem is when using with pd.Series

Minimum reproducible example:

s = pd.Series([1,1,2,2,3,3,4,5])
s.groupby(s.values).agg(one = pd.NamedAgg(column='new',aggfunc='sum'))
#Output 
   one
1    2
2    4
3    6
4    4
5    5

I expected it to raise KeyError or ValueError here.

The values of previous column are over-written when same column same is specified in pd.NamedAgg

s.groupby(s.values).agg(one=pd.NamedAgg(column='weird',aggfunc='sum'),\
                        second=pd.NamedAgg(column='weird',aggfunc='max'))
#Output 
  one  second  # Values of column `one` are over-written
1  1       1
2  2       2
3  3       3
4  4       4
5  5       5

When different column names are provided it doesn't over-write.

s.groupby(s.values).agg(one=pd.NamedAgg(column='anything',aggfunc='sum'),\
                        second=pd.NamedAgg(column='something',aggfunc='max'))

#Output 
   one  second       #Values are not over written
1    2       1     
2    4       2
3    6       3
4    4       4
5    5       5

When previous aggregation values are overwritten when the same column name is fed to pd.NamedAgg.

s.groupby(s.values).agg(one=pd.NamedAgg(column='weird',aggfunc='sum'),\
                        second=pd.NamedAgg(column='diff',aggfunc='max'),
                        third = pd.NamedAgg(column='weird',aggfunc=lambda x: ','.join(map(str,x))))

#Output 

    one  second third
1  1,1       1   1,1
2  2,2       2   2,2
3  3,3       3   3,3
4    4       4     4
5    5       5     5

The aggregation values of one are overwritten by latest aggregation values of third as they have same column name specified (pd.NamedAgg(column='weird',...), the aggregation values of second are not affected since it shares different column name in pd.NamedAgg(column='diff',...)

Link to relevant SO post

MarcoGorelli · 2020-05-26T12:57:12Z

Many thanks for the report

The aggregation values of one are overwritten by latest aggregation values of third as they have same column name specified

I'll test this out this evening, but I'd like to think it'd be solved by this PR: #30858 - will update later when I've tried

gurukiran07 · 2020-05-27T06:31:26Z

@MarcoGorelli Thank you for the reply. If the test results indicate there's a bug, I'll be more than glad to work on a PR.

MarcoGorelli · 2020-05-27T07:59:05Z

@gurukiran07 I just checkout at that PR, tried your minimal reproducible exaple above, and got

In [4]: df = pd.DataFrame({'token': [12345.0, 12345.0, 12345.0, 12345.0, 12345.0
   ...: , 12345.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0, 6789.0], 
   ...:                    'name': ['abc', 'abc', 'abc', 'abc', 'abc', 'abc', 'x
   ...: yz', 'xyz', 'xyz', 'xyz', 'xyz', 'xyz'],  
   ...:                    'ltp': [2.0, 5.0, 3.0, 9.0, 5.0, 16.0, 1.0, 5.0, 3.0,
   ...:  13.0, 9.0, 20.0],  
   ...:                    'change': [np.nan, 1.5, -0.4, 2.0, -0.444443999999999
   ...: 95, 2.2, np.nan, 4.0, -0.4, 3.333333, -0.307692, 1.222222]}) 
   ...:  
   ...: #with only 1 named aggregation : 
   ...: df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.gt(0).sum())) 
   ...: #Output                                                                 
Out[4]: 
      pos
name     
abc   3.0
xyz   3.0

In [5]: df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.gt(0).sum()),\ 
   ...:                                  sum = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.sum())) 
   ...:                                                                         
Out[5]: 
      pos       sum
name               
abc   3.0  4.855556
xyz   3.0  7.847863

In [6]:  
   ...: df.groupby('name')['change'].agg(pos = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.gt(0).sum()),\ 
   ...:                                  sum = pd.NamedAgg(column='change',aggfu
   ...: nc=lambda x:x.sum()),\ 
   ...:                                  max = pd.NamedAgg(column='change',aggfu
   ...: nc='max')) 
   ...:                                                                         
Out[6]: 
      pos       sum  max
name                    
abc   3.0  4.855556  2.2
xyz   3.0  7.847863  4.0

so it seems it's fixed.

I'll add a test case to explicitly use pd.NamedAgg

gurukiran07 · 2020-05-27T08:35:29Z

@MarcoGorelli Good to know it's fixed. One more thing I would like to suggest.

Minimum reproducible example:

s = pd.Series([1,1,2,2,3,3,4,5])
s.groupby(s.values).agg(one=pd.NamedAgg(column='anything',aggfunc='sum'),\
                        second=pd.NamedAgg(column='something',aggfunc='max'))

#Output 
   one  second   
1    2       1     
2    4       2
3    6       3
4    4       4
5    5       5

Expected Output:

KeyError : "Column 'anything' does not exist!"

Or

TypeError : got an unknown keyword column

It works with any name given to column parameter in pd.NamedAgg I would expect it raise KeyError or TypeError.

MarcoGorelli · 2020-05-27T09:17:42Z

@gurukiran07 TBH it's not clear to me what the user should pass as column in NamedAgg when using SeriesGroupBy, as there wouldn't be any columns there anyway (though if I've misunderstood, please do let me know). Are you suggesting to disallow using NamedAgg with SeriesGroupBy?

gurukiran07 · 2020-05-27T10:21:51Z

@MarcoGorelli I was trying to convey the same you understood correctly. Disallowing NamedAgg or raise an error when a user passes column parameter while using with SeriesGroupBy.

Something like this

s.groupby(s.values).agg(pos = pd.NamedAgg(aggfunc=lambda x :x.sum()))

Currently it raises TypeError: __new__() missing 1 required positional argument: 'column'

TomAugspurger · 2020-05-27T12:02:47Z

We document the behavior for Series in https://pandas.pydata.org/docs/user_guide/groupby.html#named-aggregation

Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the values are just the functions.

So for SeriesGroupBy.agg, if there are any tuples or NamedAgg present in kwargs then I think we should raise.

gurukiran07 · 2020-05-27T12:46:28Z

@TomAugspurger Thank you for the reply. Yes, raising an error would much better and consistent for SeriesGroupBy.agg. When there's no column for SeriesGroupBy but still NamedAgg accepting column as a parameter is looks glitchy and it works with whatever argument given.

if there are any tuples or NamedAgg present in kwargs then I think we should raise.
I totally agree with this.

Changing how agg of SeriesGroupBy(link SeriesGroupBy class) works can be helpful without disturbing DataFrameGroupBy.

TomAugspurger · 2020-05-27T12:56:23Z

NamedAgg is just a namedtuple. There’s nothing to be done there. The problem is solely in SerisGroupBy

MarcoGorelli · 2020-05-27T17:58:47Z

So for SeriesGroupBy.agg, if there are any tuples or NamedAgg present in kwargs then I think we should raise.

@gurukiran07 as this would be separate, if you're interested in working on it, could you open an new issue for it?

gurukiran07 · 2020-05-27T18:16:05Z

@TomAugspurger and @MarcoGorelli Yes I would be more than glad to work on it, I identified where the changes can be made so that an Error can be raised when kwargs are NamedAgg or tuples.

Adding some additional constraints in here can solve the issue groupby/generic.py.

I'll open a new issue and will start working on it and hopefully open a PR. Thank you.

TomAugspurger · 2020-06-10T15:58:18Z

I think this was fixed by #34435

jesrael added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 26, 2020

MarcoGorelli mentioned this issue May 27, 2020

BUG: aggregations were getting overwritten if they had the same name #30858

Merged

5 tasks

This was referenced May 28, 2020

BUG: SeriesGroupBy works with any column name specified in NamedAgg #34422

Closed

BUG: Fix to GH34422 SeriesGroupBy works only with 'func' now #34435

Merged

TomAugspurger closed this as completed Jun 10, 2020

TomAugspurger added this to the 1.1 milestone Jun 10, 2020

bashtage removed the Needs Triage Issue that has not been reviewed by a pandas team member label Aug 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: #34380

BUG: #34380

jesrael commented May 26, 2020 •

edited

Loading

gurukiran07 commented May 26, 2020 •

edited

Loading

INSTALLED VERSIONS

gurukiran07 commented May 26, 2020

gurukiran07 commented May 26, 2020 •

edited

Loading

MarcoGorelli commented May 26, 2020

gurukiran07 commented May 27, 2020

MarcoGorelli commented May 27, 2020

gurukiran07 commented May 27, 2020

MarcoGorelli commented May 27, 2020

gurukiran07 commented May 27, 2020

TomAugspurger commented May 27, 2020

gurukiran07 commented May 27, 2020

TomAugspurger commented May 27, 2020

MarcoGorelli commented May 27, 2020

gurukiran07 commented May 27, 2020 •

edited

Loading

TomAugspurger commented Jun 10, 2020

BUG: #34380

BUG: #34380

Comments

jesrael commented May 26, 2020 • edited Loading

From SO:

gurukiran07 commented May 26, 2020 • edited Loading

INSTALLED VERSIONS

gurukiran07 commented May 26, 2020

More weird result:

gurukiran07 commented May 26, 2020 • edited Loading

MarcoGorelli commented May 26, 2020

gurukiran07 commented May 27, 2020

MarcoGorelli commented May 27, 2020

gurukiran07 commented May 27, 2020

MarcoGorelli commented May 27, 2020

gurukiran07 commented May 27, 2020

TomAugspurger commented May 27, 2020

gurukiran07 commented May 27, 2020

TomAugspurger commented May 27, 2020

MarcoGorelli commented May 27, 2020

gurukiran07 commented May 27, 2020 • edited Loading

TomAugspurger commented Jun 10, 2020

jesrael commented May 26, 2020 •

edited

Loading

gurukiran07 commented May 26, 2020 •

edited

Loading

`More weird result:`

gurukiran07 commented May 26, 2020 •

edited

Loading

gurukiran07 commented May 27, 2020 •

edited

Loading