BUG: Fix remaining cases of groupby(...).transform with dropna=True #46367

rhshadrach · 2022-03-15T03:27:34Z

closes Ambiguous behaviour when transform groupby with NaNs #17093 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

…sform_dropna

…sform_dropna � Conflicts: � doc/source/whatsnew/v1.5.0.rst

…sform_dropna � Conflicts: � pandas/core/groupby/generic.py � pandas/core/groupby/groupby.py � pandas/core/groupby/ops.py � pandas/tests/groupby/transform/test_transform.py

… into transform_dropna

…sform_dropna

…sform_dropna � Conflicts: � pandas/core/groupby/groupby.py � pandas/core/groupby/ops.py � pandas/tests/groupby/transform/test_transform.py

rhshadrach · 2022-03-15T03:29:41Z

pandas/core/groupby/ops.py

@@ -969,7 +985,7 @@ def _cython_operation(
        """
        assert kind in ["transform", "aggregate"]

-        cy_op = WrappedCythonOp(kind=kind, how=how)
+        cy_op = WrappedCythonOp(grouper=self, kind=kind, how=how)


@jbrockmendel - It would suffice to pass self.has_dropped_na instead of the full grouper here, wasn't sure if there was a preference one way or the others.

I'd much rather just has_dropped_na be part of WrappedCythonOop's state

jbrockmendel · 2022-03-15T19:07:02Z

pandas/core/groupby/ops.py

@@ -592,6 +604,10 @@ def _call_cython_op(

        result = result.T

+        if self.how == "rank" and self.grouper.has_dropped_na:
+            # TODO: Wouldn't need this if group_rank supported mask


would this be a matter of supporting a mask arg in the libgroupby function, or would this be just for Nullable dtype? bc i have a branch on deck that does the former

The former - I believe rank had to be singled out here because it was the one transform that didn't support a mask arg.

@rhshadrach mask just got added to group_rank, but commenting this out causes a few failures. is more needed?

This comment was incorrect; opened #46953

jbrockmendel · 2022-03-15T19:08:11Z

pandas/core/groupby/ops.py

@@ -462,6 +469,11 @@ def _cython_op_ndim_compat(
            # otherwise we have OHLC
            return res.T

+        if self.kind == "transform" and self.grouper.has_dropped_na:
+            mask = comp_ids == -1
+            # make mask 2d


can this be rolled into the mask-reshaping done above?

This block turned out to not be necessary. It will be removed.

jbrockmendel · 2022-03-15T19:10:20Z

pandas/core/groupby/ops.py

@@ -260,6 +264,9 @@ def _get_output_shape(self, ngroups: int, values: np.ndarray) -> Shape:
    def _get_out_dtype(self, dtype: np.dtype) -> np.dtype:
        how = self.how

+        if self.kind == "transform" and self.grouper.has_dropped_na:
+            dtype = ensure_dtype_can_hold_na(dtype)
+


ATM we do something similar to this L578-L591. could this be rolled into that? (admittedly thats a bit messy so this may just b A Better Solution)

It seems better to me to determine the result dtype upfront, rather than cast after, as much as possible.

i dont have an opinion on before vs after so am happy to defer to you, but do strongly prefer doing things Just One Way.

Makes sense - I didn't realize that my change to _get_cython_vals actually makes values float, and so this is indeed not necessary at all.

jbrockmendel · 2022-03-15T19:12:04Z

Last thought before I go guzzle some caffeine: this looks tangentially related to #43943. Might be something worth salvaging from that.

jreback · 2022-03-16T00:48:08Z

doc/source/whatsnew/v1.5.0.rst

    In [3]: df.groupby('a', dropna=True).transform(lambda x: x.sum())
    Out[3]:
       b
    0  5
    1  5

+    In [3]: df.groupby('a', dropna=True).transform('ffill')


add a comment here on the casting of the fill value

I think you're saying to explain why the old result comes out as -9223372036854775808, which is np.nan when interpreted as an integer. Will do.

jbrockmendel · 2022-03-16T05:09:10Z

pandas/core/groupby/groupby.py

-            return self._python_apply_general(curried, self._obj_with_exclusions)
+            result = self._python_apply_general(curried, self._obj_with_exclusions)
+
+            if result.ndim == 1 and self.obj.ndim == 1 and result.name != self.obj.name:


result.name != self.obj.name will be wrong with np.nan (and will raise with pd.NA)

Doh, thanks. This highlights the fact that we shouldn't be fixing this here, but rather the logic in wrapping the apply results. I didn't want this PR spreading out to that method though. I've opened #46369 for this; I'm thinking here we ignore testing the name, and fix properly.

…sform_dropna � Conflicts: � pandas/core/groupby/ops.py

rhshadrach · 2022-03-18T21:13:34Z

Thanks for all the feedback @jbrockmendel and @jreback - ready for another review.

jreback

looks good. pls rebas some small comments. ping on greeen.

jreback · 2022-03-19T18:08:05Z

doc/source/whatsnew/v1.5.0.rst

@@ -83,32 +83,41 @@ did not have the same index as the input.

 .. code-block:: ipython

+    In [3]: df.groupby('a', dropna=True).transform('sum')


maybe be worth commenting on each of these what is changing (e.g. like you did for [3])

jreback · 2022-03-19T18:09:23Z

pandas/tests/groupby/transform/test_transform.py

@@ -394,10 +407,14 @@ def test_transform_transformation_func(request, transformation_func):
        mock_op = lambda x: getattr(x, transformation_func)()

    result = test_op(df.groupby("A"))
-    groups = [df[["B"]].iloc[:4], df[["B"]].iloc[4:6], df[["B"]].iloc[6:]]
-    expected = concat([mock_op(g) for g in groups])
+    # pass the group in same order as iterating `for ... in df.groupby(...)`


can you link the issue/PR here

Within this test, the issue number is on L374.

…sform_dropna � Conflicts: � pandas/tests/groupby/transform/test_transform.py

…sform_dropna

jreback · 2022-03-22T23:25:05Z

thanks @rhshadrach

…andas-dev#46367)

rhshadrach added 17 commits February 5, 2022 13:59

BUG: Correct results for groupby(...).transform with null keys

2143c59

Merge branch 'main' of https://github.com/pandas-dev/pandas into tran…

deb3edb

…sform_dropna

Series tests

04a2a6b

Remove reliance on Series

ad7bf3e

Merge branch 'main' of https://github.com/pandas-dev/pandas into tran…

39e1438

…sform_dropna

Merge branch 'main' of https://github.com/pandas-dev/pandas into tran…

7a4a99b

…sform_dropna � Conflicts: � doc/source/whatsnew/v1.5.0.rst

Merge branch 'main' of https://github.com/pandas-dev/pandas into tran…

f234a89

…sform_dropna � Conflicts: � pandas/core/groupby/generic.py � pandas/core/groupby/groupby.py � pandas/core/groupby/ops.py � pandas/tests/groupby/transform/test_transform.py

Merge cleanup - WIP

d1eeb65

Merge branch 'transform_dropna' of https://github.com/rhshadrach/pandas…

e57025d

… into transform_dropna

Fixes for ngroup

be2f45c

Merge branch 'main' of https://github.com/pandas-dev/pandas into tran…

8e5df01

…sform_dropna

name in some cases

2d79a0f

Merge branch 'main' of https://github.com/pandas-dev/pandas into tran…

e2f3080

…sform_dropna � Conflicts: � pandas/core/groupby/groupby.py � pandas/core/groupby/ops.py � pandas/tests/groupby/transform/test_transform.py

Fixups for ngroup

d9da864

Fixups

b973af0

Fixups

67ffb60

Fixups

e31041d

rhshadrach requested a review from jbrockmendel March 15, 2022 03:28

rhshadrach commented Mar 15, 2022

View reviewed changes

jbrockmendel reviewed Mar 15, 2022

View reviewed changes

jreback added the Groupby label Mar 16, 2022

jreback added this to the 1.5 milestone Mar 16, 2022

jreback requested changes Mar 16, 2022

View reviewed changes

jbrockmendel reviewed Mar 16, 2022

View reviewed changes

rhshadrach added 2 commits March 18, 2022 16:29

Merge branch 'main' of https://github.com/pandas-dev/pandas into tran…

60e60b0

…sform_dropna � Conflicts: � pandas/core/groupby/ops.py

PR feedback

04293a7

jreback requested changes Mar 19, 2022

View reviewed changes

rhshadrach added 5 commits March 19, 2022 15:37

whatsnew improvements

9a6dc3c

Merge branch 'main' of https://github.com/pandas-dev/pandas into tran…

d7bb828

…sform_dropna � Conflicts: � pandas/tests/groupby/transform/test_transform.py

merge cleanup

6571291

type-hint fixup

a8f524b

Merge branch 'main' of https://github.com/pandas-dev/pandas into tran…

2a02de0

…sform_dropna

jreback approved these changes Mar 22, 2022

View reviewed changes

jreback merged commit 8714948 into pandas-dev:main Mar 22, 2022

rhshadrach deleted the transform_dropna branch April 10, 2022 22:23

yehoshuadimarsky pushed a commit to yehoshuadimarsky/pandas that referenced this pull request Jul 13, 2022

BUG: Fix remaining cases of groupby(...).transform with dropna=True (p…

406ab9f

…andas-dev#46367)

rhshadrach mentioned this pull request Dec 8, 2022

BUG: GroupBy.ngroup dropna=False inconsistency when using Categorical #50100

Closed

3 tasks

rhshadrach added the Missing-data np.nan, pd.NaT, pd.NA, dropna, isnull, interpolate label Jul 15, 2023

rhshadrach mentioned this pull request Jul 15, 2023

Inconsistent result with cumsum columns #32462

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Fix remaining cases of groupby(...).transform with dropna=True #46367

BUG: Fix remaining cases of groupby(...).transform with dropna=True #46367

rhshadrach commented Mar 15, 2022

rhshadrach Mar 15, 2022

jbrockmendel Mar 15, 2022

jbrockmendel Mar 15, 2022

rhshadrach Mar 18, 2022

jbrockmendel May 4, 2022

rhshadrach May 6, 2022

jbrockmendel Mar 15, 2022

rhshadrach Mar 16, 2022 •

edited

Loading

jbrockmendel Mar 15, 2022

rhshadrach Mar 16, 2022

jbrockmendel Mar 16, 2022

rhshadrach Mar 18, 2022

jbrockmendel commented Mar 15, 2022

jreback Mar 16, 2022

rhshadrach Mar 16, 2022

jbrockmendel Mar 16, 2022

rhshadrach Mar 18, 2022 •

edited

Loading

rhshadrach commented Mar 18, 2022

jreback left a comment

jreback Mar 19, 2022

jreback Mar 19, 2022

rhshadrach Mar 19, 2022

jreback commented Mar 22, 2022

		@@ -83,32 +83,41 @@ did not have the same index as the input.

		.. code-block:: ipython

		In [3]: df.groupby('a', dropna=True).transform('sum')

BUG: Fix remaining cases of groupby(...).transform with dropna=True #46367

BUG: Fix remaining cases of groupby(...).transform with dropna=True #46367

Conversation

rhshadrach commented Mar 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Mar 16, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rhshadrach Mar 18, 2022 • edited Loading

Choose a reason for hiding this comment

rhshadrach commented Mar 18, 2022

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 22, 2022

rhshadrach Mar 16, 2022 •

edited

Loading

rhshadrach Mar 18, 2022 •

edited

Loading