BUG: Groupby transform with missing groups #8955

miketkelly · 2014-12-01T22:06:13Z

In a groupby/transform when some of the groups are missing, should the transformed values be set to missing (my preference), left unchanged, or should this be an error? Currently the behavior is inconsistent between Series and Frames, and between cythonized and non-cythonized transformations.

For a Series with a non-cythonized transformation, the values are left unchanged:

>>> import pandas as pd
>>> import numpy as np

>>> s = pd.Series([100, 200, 300, 400])
>>> s.groupby([1, 1, np.nan, np.nan]).transform(pd.Series.mean)
0    200
1    200
2    300
3    400

For a Series with cythonized functions, its an error (this changed between 0.14.1 and 0.15.0):

>>> s.groupby([1, 1, np.nan, np.nan]).transform(np.mean)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/pandas/core/groupby.py", line 2425, in transform
    return self._transform_fast(cyfunc)
  File "pandas/pandas/core/groupby.py", line 2466, in _transform_fast
    return self._set_result_index_ordered(Series(values))
  File "pandas/pandas/core/groupby.py", line 494, in _set_result_index_ordered
    result.index = self.obj.index
  File "pandas/pandas/core/generic.py", line 1948, in __setattr__
    object.__setattr__(self, name, value)
  File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41020)
  File "pandas/pandas/core/series.py", line 262, in _set_axis
    self._data.set_axis(axis, labels)
  File "pandas/pandas/core/internals.py", line 2217, in set_axis
    'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 2 elements, new values have 4 elements

For DataFrames, the results are opposite:

>>> f = pd.DataFrame({'a': s, 'b': s * 2})
>>> f
     a    b
0  100  200
1  200  400
2  300  600
3  400  800
>>> f.groupby([1, 1, np.nan, np.nan]).transform(np.sum)
     a    b
0  300  600
1  300  600
2  300  600
3  400  800
>>> f.groupby([1, 1, np.nan, np.nan]).transform(pd.DataFrame.sum)
Traceback (most recent call last):
  File "pandas/pandas/core/groupby.py", line 3002, in transform
    return self._transform_general(func, *args, **kwargs)
  File "pandas/pandas/core/groupby.py", line 2968, in _transform_general
    return self._set_result_index_ordered(concatenated)
  File "pandas/pandas/core/groupby.py", line 494, in _set_result_index_ordered
    result.index = self.obj.index
  File "pandas/pandas/core/generic.py", line 1948, in __setattr__
    object.__setattr__(self, name, value)
  File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41020)
  File "pandas/pandas/core/generic.py", line 406, in _set_axis
    self._data.set_axis(axis, labels)
  File "pandas/pandas/core/internals.py", line 2217, in set_axis
    'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 2 elements, new values have 4 elements

>>> print(pd.__version__)
0.15.1-125-ge463818

The text was updated successfully, but these errors were encountered:

jreback · 2014-12-02T11:19:15Z

agree on the consistency

on think a pass thru grouping is probably the most intuitive
iow the values should be unchanged by a transform

care to do a pr to fix?

kcarnold · 2017-07-20T16:12:33Z

This is still a bug (v0.20.3), though interestingly the Series now works correctly for np.mean (which crashed for the OP) and crashes for pd.Series.mean.

#9941 was a similar issue that got fixed recently.

gfyoung · 2017-07-20T16:15:43Z

@kcarnold : Interesting...is the stack-trace the same as before?

BTW, given what you said, we should probably add tests to confirm your first statement, though a PR to patch the still crashing behavior is still welcome!

gfyoung · 2017-07-20T16:18:33Z

agree on the consistency

@jreback : Well, you have your consistency now it seems. They both work with numpy, but they both fail with the pandas-defined aggregate functions. 😄

rhshadrach · 2021-01-10T14:50:29Z

Now all four ops do not raise, but give inconsistent results. Using np.mean and np.sum you get four rows with NaNs filled in for the missing keys. Using pd...mean and pd...sum there are two rows. From #35751 (cc @arw2019 ), I think the output here (where dropna=True) should have two rows.

I think the issue is here:

pandas/pandas/core/groupby/generic.py

Lines 559 to 567 in 53810fd

    
               def _transform_fast(self, result) -> Series: 
        
                   """ 
        
                   fast version of transform, only applicable to 
        
                   builtin/cythonizable functions 
        
                   """ 
        
                   ids, _, ngroup = self.grouper.group_info 
        
                   result = result.reindex(self.grouper.result_index, copy=False) 
        
                   out = algorithms.take_1d(result._values, ids) 
        
                   return self.obj._constructor(out, index=self.obj.index, name=self.obj.name)

where it is incorrect to use self.obj.index.

The result when dropna=False is consistent and correct.

rhshadrach · 2023-01-07T21:10:32Z

All four results are now consistent and correct, according to

https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.5.0.html#using-dropna-true-with-groupby-transforms

s = pd.Series([100, 200, 300, 400])
print(s.groupby([1, 1, np.nan, np.nan]).transform(pd.Series.mean))
# 0    150.0
# 1    150.0
# 2      NaN
# 3      NaN
# dtype: float64
print(s.groupby([1, 1, np.nan, np.nan]).transform(np.mean))
# 0    150.0
# 1    150.0
# 2      NaN
# 3      NaN
# dtype: float64

f = pd.DataFrame({'a': s, 'b': s * 2})
print(f.groupby([1, 1, np.nan, np.nan]).transform(np.sum))
#        a      b
# 0  300.0  600.0
# 1  300.0  600.0
# 2    NaN    NaN
# 3    NaN    NaN
print(f.groupby([1, 1, np.nan, np.nan]).transform(pd.DataFrame.sum))
#        a      b
# 0  300.0  600.0
# 1  300.0  600.0
# 2    NaN    NaN
# 3    NaN    NaN

Tests were added to pandas.tests.groupby.dropna as part of the fixes linked above.

jreback added this to the 0.16.0 milestone Dec 2, 2014

jreback added Bug Groupby labels Dec 2, 2014

jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015

jbrockmendel added the Apply Apply, Aggregate, Transform, Map label Dec 1, 2019

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach closed this as completed Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Groupby transform with missing groups #8955

BUG: Groupby transform with missing groups #8955

miketkelly commented Dec 1, 2014

jreback commented Dec 2, 2014

kcarnold commented Jul 20, 2017

gfyoung commented Jul 20, 2017

gfyoung commented Jul 20, 2017 •

edited

Loading

rhshadrach commented Jan 10, 2021 •

edited

Loading

rhshadrach commented Jan 7, 2023

BUG: Groupby transform with missing groups #8955

BUG: Groupby transform with missing groups #8955

Comments

miketkelly commented Dec 1, 2014

jreback commented Dec 2, 2014

kcarnold commented Jul 20, 2017

gfyoung commented Jul 20, 2017

gfyoung commented Jul 20, 2017 • edited Loading

rhshadrach commented Jan 10, 2021 • edited Loading

rhshadrach commented Jan 7, 2023

gfyoung commented Jul 20, 2017 •

edited

Loading

rhshadrach commented Jan 10, 2021 •

edited

Loading