Skip to content

groupby/transform with NaNs in grouped column #9941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
evanpw opened this issue Apr 19, 2015 · 2 comments · Fixed by #14907
Closed

groupby/transform with NaNs in grouped column #9941

evanpw opened this issue Apr 19, 2015 · 2 comments · Fixed by #14907
Labels
API Design Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Milestone

Comments

@evanpw
Copy link
Contributor

evanpw commented Apr 19, 2015

What's the expected behavior when grouping on a column containing NaN and then applying transform? For a Series, the current result is to throw an exception:

>>> df = pd.DataFrame({
...     'a' : range(10),
...     'b' : [1, 1, 2, 3, np.nan, 4, 4, 5, 5, 5]})
>>> 
>>> df.groupby(df.b)['a'].transform(max)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/groupby.py", line 2422, in transform
    return self._transform_fast(cyfunc)
  File "pandas/core/groupby.py", line 2463, in _transform_fast
    return self._set_result_index_ordered(Series(values))
  File "pandas/core/groupby.py", line 498, in _set_result_index_ordered
    result.index = self.obj.index
  File "pandas/core/generic.py", line 1997, in __setattr__
    return object.__setattr__(self, name, value)
  File "pandas/src/properties.pyx", line 65, in pandas.lib.AxisProperty.__set__ (pandas/lib.c:41301)
    obj._set_axis(self.axis, value)
  File "pandas/core/series.py", line 273, in _set_axis
    self._data.set_axis(axis, labels)
  File "pandas/core/internals.py", line 2219, in set_axis
    'new values have %d elements' % (old_len, new_len))
ValueError: Length mismatch: Expected axis has 9 elements, new values have 10 elements

For a DataFrame, the missing value gets filled in with what looks like an uninitialized value from np.empty_like:

>>> df.groupby(df.b).transform(max)
   a
0  1
1  1
2  2
3  3
4 -1
5  6
6  6
7  9
8  9
9  9

It seems like either it should fill in the missing values with NaN (which might require a change of dtype), or just drop those rows from the result (which requires the shape to change). Either solution has the potential to surprise.

@jreback
Copy link
Contributor

jreback commented Apr 19, 2015

http://pandas.pydata.org/pandas-docs/stable/groupby.html#na-group-handling

This should work, so this is a bug as the NA group is not defined. Resultant value should be NaN.

@jreback jreback added this to the Next Major Release milestone Apr 19, 2015
@jreback
Copy link
Contributor

jreback commented Apr 19, 2015

xref #5456
xref #6992
xref #443

@jreback jreback added Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate API Design Difficulty Intermediate labels Apr 19, 2015
mroeschke added a commit to mroeschke/pandas that referenced this issue Dec 18, 2016
@jreback jreback modified the milestones: 0.20.0, Next Major Release Dec 18, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Bug Groupby Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants