-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
groupby.transform inconsistent behavior when grouping by columns containing NaN #10923
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I think we already have an issue for this - can u check? agree that could be a nicer error message / better behavior |
Whoops, you're right. Sorry, I should have searched more thoroughly. This duplicates #9941. |
I have a similar issue here. Let's consider the following data: import numpy
import pandas
df = pandas.DataFrame({'A':numpy.random.rand(20),
'B':numpy.random.rand(20)*10,
'C':numpy.random.randint(0,5,20)})
df.loc[:4,'C']=None Now, there are two code lines below that do the same think: to output the average of groups as the new rows values. The first one uses a In [41]: df.groupby('C')['B'].transform('mean')
Out[41]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 5.670891
6 5.335332
7 0.580197
8 5.670891
9 5.670891
10 1.628290
11 1.628290
12 5.670891
13 8.493416
14 5.670891
15 8.493416
16 5.335332
17 5.670891
18 5.670891
19 5.335332
Name: B, dtype: float64 In [42]: df.groupby('C')['B'].transform(lambda x:x.mean())
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-42-87c87c7a22f4> in <module>()
----> 1 df.groupby('C')['B'].transform(lambda x:x.mean())
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/groupby.py in transform(self, func, *args, **kwargs)
3061
3062 result.name = self._selected_obj.name
-> 3063 result.index = self._selected_obj.index
3064 return result
3065
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/generic.py in __setattr__(self, name, value)
3092 try:
3093 object.__getattribute__(self, name)
-> 3094 return object.__setattr__(self, name, value)
3095 except AttributeError:
3096 pass
pandas/_libs/src/properties.pyx in pandas._libs.lib.AxisProperty.__set__ (pandas/_libs/lib.c:45255)()
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/series.py in _set_axis(self, axis, labels, fastpath)
306 object.__setattr__(self, '_index', labels)
307 if not fastpath:
--> 308 self._data.set_axis(axis, labels)
309
310 def _set_subtyp(self, is_all_dates):
~/.conda/envs/myroot/lib/python3.6/site-packages/pandas/core/internals.py in set_axis(self, axis, new_labels)
2834 raise ValueError('Length mismatch: Expected axis has %d elements, '
2835 'new values have %d elements' %
-> 2836 (old_len, new_len))
2837
2838 self.axes[axis] = new_labels
ValueError: Length mismatch: Expected axis has 15 elements, new values have 20 elements The first one, using I first posted this question to SO: https://stackoverflow.com/questions/45333681/handling-na-in-groupby-transform . After some discussion there I started to think that a bug is around. Thanks |
I suspect it has to deal with the way in which we aggregate results at the end. Post this as a separate issue and reference this one. |
This is similar to #9697, which was fixed in 0.16.1. I give a (very) slightly modified example here to show some related behavior which is at least inconsistent and should probably be handled cleanly.
It's not entirely clear to me what the desired behavior is in this case; it's possible that transform should not work here at all, since it spits out unexpected values. But at minimum it seems like it should do the same thing no matter how I invoke it below.
Example:
Error is similar to the one encountered in the previous issue:
The text was updated successfully, but these errors were encountered: