Skip to content

BUG: DataFrameGroupBy.transform and ngroup do not work with cumcount #27858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v0.25.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -121,7 +121,7 @@ Groupby/resample/rolling
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.transform` where applying a timezone conversion lambda function would drop timezone information (:issue:`27496`)
- Bug in windowing over read-only arrays (:issue:`27766`)
- Fixed segfault in `pandas.core.groupby.DataFrameGroupBy.quantile` when an invalid quantile was passed (:issue:`27470`)
-
- Bug in :meth:`pandas.core.groupby.DataFrameGroupBy.transform` where ``cumcount`` and ``ngroup`` fail (:issue:`27472` and :issue:`27468`)

Reshaping
^^^^^^^^^
Expand Down
6 changes: 4 additions & 2 deletions pandas/core/groupby/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -100,7 +100,9 @@ def _gotitem(self, key, ndim, subset=None):

# cythonized transformations or canned "agg+broadcast", which do not
# require postprocessing of the result by transform.
cythonized_kernels = frozenset(["cumprod", "cumsum", "shift", "cummin", "cummax"])
cythonized_kernels = frozenset(
["cumprod", "cumsum", "shift", "cummin", "cummax", "cumcount"]
)

cython_cast_blacklist = frozenset(["rank", "count", "size", "idxmin", "idxmax"])

Expand All @@ -120,7 +122,6 @@ def _gotitem(self, key, ndim, subset=None):
"mean",
"median",
"min",
"ngroup",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leave this one

"nth",
"nunique",
"prod",
Expand Down Expand Up @@ -158,6 +159,7 @@ def _gotitem(self, key, ndim, subset=None):
"rank",
"shift",
"tshift",
"ngroup"
]
)

Expand Down
3 changes: 2 additions & 1 deletion pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -583,7 +583,8 @@ def transform(self, func, *args, **kwargs):
if not (func in base.transform_kernel_whitelist):
msg = "'{func}' is not a valid function name for transform(name)"
raise ValueError(msg.format(func=func))
if func in base.cythonized_kernels:
# transformation are added as well since they are broadcasted already
if func in base.cythonized_kernels or func in base.transformation_kernels:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of this can't you just add cumcount to the transformation list? This somewhat blurs the line between cythonized_kernels which I think describe performance ops and transformation_kernels which describe output shape

# cythonized transformation or canned "reduction+broadcast"
return getattr(self, func)(*args, **kwargs)
else:
Expand Down
47 changes: 45 additions & 2 deletions pandas/tests/groupby/test_transform.py
Original file line number Diff line number Diff line change
Expand Up @@ -1034,8 +1034,6 @@ def test_transform_agg_by_name(reduction_func, obj):
func = reduction_func
g = obj.groupby(np.repeat([0, 1], 3))

if func == "ngroup": # GH#27468
pytest.xfail("TODO: g.transform('ngroup') doesn't work")
if func == "size": # GH#27469
pytest.xfail("TODO: g.transform('size') doesn't work")

Expand Down Expand Up @@ -1074,3 +1072,48 @@ def test_transform_lambda_with_datetimetz():
name="time",
)
assert_series_equal(result, expected)


def test_transform_cumcount_ngroup():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not already covered by test below? Ideally we can rely on fixtures rather than one-off tests like this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's a bit different than below... the test below is just to ensure after transform, the index of transformed result is the same as the original dataset.

But you are right, this should not be in one-off tests... will look for if there is fixture for this already @WillAyd

df = DataFrame(dict(a=[0, 0, 0, 1, 1, 1], b=range(6)))
g = df.groupby(np.repeat([0, 1], 3))

# GH 27472
result = g.transform("cumcount")
expected = g.cumcount()
assert_series_equal(result, expected)

# GH 27468
result = g.transform("ngroup")
expected = g.ngroup()
assert_series_equal(result, expected)


@pytest.mark.parametrize(
"func",
[
"backfill",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there not already a fixture for these we can leverage?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, will change! thanks for review!

"bfill",
"cumcount",
"cummax",
"cummin",
"cumprod",
"cumsum",
"diff",
"ffill",
"pad",
"pct_change",
"rank",
"shift",
"ngroup",
],
)
def test_transformation_kernels_length(func):
# This test is to evaluate if after transformation, the index
# of transformed data is still the same with original DataFrame
# TODO: exceptions are fillna, tshfit and corrwith
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, these would be xfails in the param list

    pytest.param('fillna', marks=pytest.mark.xfail('reason'),

Can you do that here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, nice tip! thanks! Also i found out not all transform functions are tested, e.g. fillna or tshift etc, is it okay I open another PR to add tests for the untested function? (some might have bug) @TomAugspurger

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that would be great.

df = DataFrame(dict(a=[0, 0, 0, 1, 1, 1], b=range(6)))
g = df.groupby(np.repeat([0, 1], 3))

result = g.transform(func)
assert (result.index == df.index).all()