Skip to content

Commit 1204372

Browse files
cgangwar11Pingviinituutti
authored andcommitted
BUG : ValueError in case on NaN value in groupby columns (pandas-dev#24850)
1 parent 9f0f29d commit 1204372

File tree

3 files changed

+35
-0
lines changed

3 files changed

+35
-0
lines changed

doc/source/whatsnew/v0.24.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -1786,6 +1786,7 @@ Groupby/Resample/Rolling
17861786
- Bug in :meth:`DataFrame.groupby` did not respect the ``observed`` argument when selecting a column and instead always used ``observed=False`` (:issue:`23970`)
17871787
- Bug in :func:`pandas.core.groupby.SeriesGroupBy.pct_change` or :func:`pandas.core.groupby.DataFrameGroupBy.pct_change` would previously work across groups when calculating the percent change, where it now correctly works per group (:issue:`21200`, :issue:`21235`).
17881788
- Bug preventing hash table creation with very large number (2^32) of rows (:issue:`22805`)
1789+
- Bug in groupby when grouping on categorical causes ``ValueError`` and incorrect grouping if ``observed=True`` and ``nan`` is present in categorical column (:issue:`24740`, :issue:`21151`).
17891790

17901791
Reshaping
17911792
^^^^^^^^^

pandas/core/groupby/grouper.py

+1
Original file line numberDiff line numberDiff line change
@@ -299,6 +299,7 @@ def __init__(self, index, grouper=None, obj=None, name=None, level=None,
299299
self._labels = self.grouper.codes
300300
if observed:
301301
codes = algorithms.unique1d(self.grouper.codes)
302+
codes = codes[codes != -1]
302303
else:
303304
codes = np.arange(len(categories))
304305

pandas/tests/groupby/test_categorical.py

+33
Original file line numberDiff line numberDiff line change
@@ -420,6 +420,39 @@ def test_observed_groups(observed):
420420
tm.assert_dict_equal(result, expected)
421421

422422

423+
def test_observed_groups_with_nan(observed):
424+
# GH 24740
425+
df = pd.DataFrame({'cat': pd.Categorical(['a', np.nan, 'a'],
426+
categories=['a', 'b', 'd']),
427+
'vals': [1, 2, 3]})
428+
g = df.groupby('cat', observed=observed)
429+
result = g.groups
430+
if observed:
431+
expected = {'a': Index([0, 2], dtype='int64')}
432+
else:
433+
expected = {'a': Index([0, 2], dtype='int64'),
434+
'b': Index([], dtype='int64'),
435+
'd': Index([], dtype='int64')}
436+
tm.assert_dict_equal(result, expected)
437+
438+
439+
def test_dataframe_categorical_with_nan(observed):
440+
# GH 21151
441+
s1 = pd.Categorical([np.nan, 'a', np.nan, 'a'],
442+
categories=['a', 'b', 'c'])
443+
s2 = pd.Series([1, 2, 3, 4])
444+
df = pd.DataFrame({'s1': s1, 's2': s2})
445+
result = df.groupby('s1', observed=observed).first().reset_index()
446+
if observed:
447+
expected = DataFrame({'s1': pd.Categorical(['a'],
448+
categories=['a', 'b', 'c']), 's2': [2]})
449+
else:
450+
expected = DataFrame({'s1': pd.Categorical(['a', 'b', 'c'],
451+
categories=['a', 'b', 'c']),
452+
's2': [2, np.nan, np.nan]})
453+
tm.assert_frame_equal(result, expected)
454+
455+
423456
def test_datetime():
424457
# GH9049: ensure backward compatibility
425458
levels = pd.date_range('2014-01-01', periods=4)

0 commit comments

Comments
 (0)