`SubClassedDataFrame.groupby().mean()` etc. use method of `SubClassedDataFrame` #51765

AdamOrmondroyd · 2023-03-03T18:11:01Z

closes BUG: DataFrameSubClass.groupby() doesn't use methods of subclass #51757
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

I appreciate that changing the behaviour for all subclasses is overkill, so pointers on how to better do this would be great

jbrockmendel · 2023-03-03T18:25:29Z

pandas/core/groupby/generic.py

@@ -687,7 +689,7 @@ def value_counts(
            # in a backward compatible way
            # GH38672 relates to categorical dtype
            ser = self.apply(
-                Series.value_counts,
+                self._obj_1d_constructor.value_counts,


this is trouble bc in general we can't assume that _constructor is a class

What else could it be (practically speaking, I know it's Callable)?

Geopandas has a callable that can dispatch to different classes. @jorisvandenbossche has argued against deprecating allowing this.

I've added a check whether self._obj_1d_constructor is a Series

pandas/tests/groupby/test_groupby_subclass.py

jorisvandenbossche · 2023-03-06T13:12:22Z

I appreciate that changing the behaviour for all subclasses is overkill

I think that this might indeed be an issue for subclasses that don't override those methods. If we want to keep a fix like this, maybe checking whether the method is the one from pandas or not might be better (something like self.obj.mean is Series.mean instead of type(self.obj) is Series).

Because with the current change, if you have a subclass that doesn't influence how aggregations are done, you would get a big slowdown because of the current PR I think?

jorisvandenbossche · 2023-03-06T13:17:31Z

pandas/core/groupby/generic.py

        if is_categorical_dtype(val.dtype) or (
            bins is not None and not np.iterable(bins)
        ):
            # scalar bins cannot be done at top level
            # in a backward compatible way
            # GH38672 relates to categorical dtype
            ser = self.apply(
-                Series.value_counts,
+                constructor_1d.value_counts,


Could this also be something like lambda group, **kwargs: group.value_counts(**kwargs)?

(didn't look in detail at how this code is working, so potentially this doesn't make sense at all)

pandas/core/groupby/groupby.py

AdamOrmondroyd · 2023-03-07T10:28:15Z

Is the WeightedDataFrame thing referenced in #51757 the motivating case here?

Yes, our classes add a set of "weights" to the index which are used to calculate statistics such as the mean.

What if instead of this we did did something like
class DataFrame:
    @property
    def _gb_cls(self):
        return DataFrameGroupBy
and then had DataFrame.groupby return an instance of _gb_cls. Then subclasses could override that.

That would function, though feel this would be unexpected extra work for developers, so should be mentioned in the extending pandas section of the docs.

I was very surprised that groupby() doesn't just use the methods of its underlying class. If DataFrame.mean() and Series.mean() are themselves optimised, what is gained by GroupBy reimplementing them?

jbrockmendel · 2023-03-07T17:17:54Z

If DataFrame.mean() and Series.mean() are themselves optimised, what is gained by GroupBy reimplementing them?

Performance. The paths that iterate over groups generally make a sorted copy of the DataFrame to make that iteration faster, but the iteration itself is still costly.

AdamOrmondroyd · 2023-03-10T14:43:58Z

If DataFrame.mean() and Series.mean() are themselves optimised, what is gained by GroupBy reimplementing them?

Performance. The paths that iterate over groups generally make a sorted copy of the DataFrame to make that iteration faster, but the iteration itself is still costly.

but, if these behaviours only change for subclasses, then performance is only lost for subclasses which override mean etc, in which case surely it is preferable that the intended results are returned rather than a fast one?

jbrockmendel · 2023-03-10T18:38:11Z

but, if these behaviours only change for subclasses, then performance is only lost for subclasses which override mean etc, in which case surely it is preferable that the intended results are returned rather than a fast one?

The question was why we use a cython path instead of iterating over DataFrame.mean. I gave an explanation of the status quo, not an argument against the fix suggested here.

AdamOrmondroyd · 2023-03-13T14:43:58Z

but, if these behaviours only change for subclasses, then performance is only lost for subclasses which override mean etc, in which case surely it is preferable that the intended results are returned rather than a fast one?

The question was why we use a cython path instead of iterating over DataFrame.mean. I gave an explanation of the status quo, not an argument against the fix suggested here.

Apologies, I misunderstood the difference you were highlighting.

How do you think I should proceed?

jbrockmendel · 2023-03-14T20:16:04Z

pandas/core/groupby/generic.py

@@ -681,14 +683,20 @@ def value_counts(

        index_names = self.grouper.names + [self.obj.name]

+        constructor_1d = (
+            self._obj_1d_constructor
+            if isinstance(self._obj_1d_constructor, Series)


i think issubclass rather than isinstance?

jbrockmendel · 2023-03-14T20:16:50Z

pandas/core/groupby/generic.py

@@ -1290,9 +1298,9 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
        elif relabeling:
            # this should be the only (non-raising) case with relabeling
            # used reordered index of columns
-            result = cast(DataFrame, result)
+            result = cast(self.obj._constructor, result)


these will be wrong if _constructor is not a class

jbrockmendel · 2023-03-14T20:18:13Z

How do you think I should proceed?

Unless there is a viable way to handle your use case with an ExtensionArray (i.e. the weights would be somehow included with the values, not at the DataFrame level), then the approach here seems fine (albeit ugly)

jbrockmendel · 2023-04-21T15:35:17Z

doc/source/whatsnew/v2.1.0.rst

@@ -223,6 +223,10 @@ Plotting
 - Bug in :meth:`Series.plot` when invoked with ``color=None`` (:issue:`51953`)
 -

+Groupby
+- Bug in :meth:`GroupBy.mean`, :meth:`GroupBy.median`, :meth:`GroupBy.std`, :meth:`GroupBy.var`, :meth:`GroupBy.sem`, :meth:`GroupBy.prod`, :meth:`GroupBy.min`, :meth:`GroupBy.max` don't use corresponding methods of subclasses of :class:`Series` or :class:`DataFrame` (:issue:`51757`)


i think this belongs below in the Groupby/resample/rolling section?

jbrockmendel · 2023-04-21T15:36:23Z

pandas/core/groupby/groupby.py

+        if not (
+            type(self.obj).median is Series.median
+            or type(self.obj).median is DataFrame.median
+        ):


if this pattern is going to show up a lot, does it merit a decorator?

I've had a first pass at making a decorator, not sure how to deal with engine and engine_kwargs

mroeschke · 2023-08-11T00:59:01Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

AdamOrmondroyd added 11 commits March 2, 2023 15:54

a whole bunch of 1d constructors

df9b39a

got the first three of Lukas' assertions working

de8c231

remove print statements

d057cd0

Merge branch 'main' into groupby

e56b488

create test for overridden methods

5338d3f

also attack median, std, var, sem, prod, sum, min and max

5fe7f82

Merge branch 'main' into groupby

8f32b12

add tests for test of methods

4aa2b85

change 1d constructors to constructors

8d7346d

tidy up

37ae233

change to np.all

aa57cc2

jbrockmendel reviewed Mar 3, 2023

View reviewed changes

lukashergt reviewed Mar 3, 2023

View reviewed changes

pandas/tests/groupby/test_groupby_subclass.py Outdated Show resolved Hide resolved

AdamOrmondroyd and others added 4 commits March 3, 2023 19:17

remove deliberate test failure

b3df075

check that self._obj_1d_constructor is Series

adc132a

add entry to docs

31868ff

Merge branch 'main' into groupby

c0b2ad7

jbrockmendel mentioned this pull request Mar 3, 2023

DEPR: require _constructor/_constructor_sliced to return a class #51772

Open

Merge branch 'main' into groupby

98b7986

jorisvandenbossche reviewed Mar 6, 2023

View reviewed changes

AdamOrmondroyd changed the title ~~SubClassedDataFrame.groupby.mean() etc. use method of SubClassedDataFrame~~ SubClassedDataFrame.groupby().mean() etc. use method of SubClassedDataFrame Mar 6, 2023

AdamOrmondroyd and others added 4 commits March 6, 2023 13:43

Merge branch 'main' into groupby

053c865

check for equality of mean methods

2efa052

repeat for other methods

1505a1c

also test Series

af9ac26

AdamOrmondroyd commented Mar 6, 2023

View reviewed changes

pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved

AdamOrmondroyd added 3 commits March 6, 2023 16:48

pass through numeric_only

f4bc548

reinstate type hinting

12a9fa8

add type() to method comparison

f46eea9

AdamOrmondroyd and others added 3 commits March 7, 2023 10:49

test transform

185a3c1

correct _constructor

bf9bde6

Merge branch 'main' into groupby

036d662

Merge branch 'main' into groupby

f348648

jbrockmendel reviewed Mar 14, 2023

View reviewed changes

AdamOrmondroyd and others added 2 commits March 27, 2023 12:32

Merge branch 'main' into groupby

48ceb0a

remove unnecessary(?) if statement

a6be1ea

jbrockmendel reviewed Apr 21, 2023

View reviewed changes

AdamOrmondroyd added 9 commits April 24, 2023 12:28

Merge branch 'main' into groupby

f0ed14a

first pass at decorator

6631b1e

add decorator to other methods

94dc186

missed max()

310c339

add @wraps

27c4ed9

Merge branch 'main' into groupby

9bafd9a

Merge branch 'main' into groupby

963b3fe

add tests for series example

8a9f30f

Merge branch 'main' into groupby

518d42e

mroeschke added Groupby Subclassing Subclassing pandas objects labels May 18, 2023

Merge branch 'main' into groupby

6320057

mroeschke closed this Aug 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`SubClassedDataFrame.groupby().mean()` etc. use method of `SubClassedDataFrame` #51765

`SubClassedDataFrame.groupby().mean()` etc. use method of `SubClassedDataFrame` #51765

AdamOrmondroyd commented Mar 3, 2023 •

edited

Loading

jbrockmendel Mar 3, 2023

AdamOrmondroyd Mar 3, 2023 •

edited

Loading

jbrockmendel Mar 3, 2023

AdamOrmondroyd Mar 3, 2023

jorisvandenbossche commented Mar 6, 2023

jorisvandenbossche Mar 6, 2023

AdamOrmondroyd commented Mar 7, 2023 •

edited

Loading

jbrockmendel commented Mar 7, 2023

AdamOrmondroyd commented Mar 10, 2023

jbrockmendel commented Mar 10, 2023

AdamOrmondroyd commented Mar 13, 2023

jbrockmendel Mar 14, 2023

jbrockmendel Mar 14, 2023

jbrockmendel commented Mar 14, 2023

jbrockmendel Apr 21, 2023

jbrockmendel Apr 21, 2023

AdamOrmondroyd Apr 25, 2023

mroeschke commented Aug 11, 2023

SubClassedDataFrame.groupby().mean() etc. use method of SubClassedDataFrame #51765

SubClassedDataFrame.groupby().mean() etc. use method of SubClassedDataFrame #51765

Conversation

AdamOrmondroyd commented Mar 3, 2023 • edited Loading

Choose a reason for hiding this comment

AdamOrmondroyd Mar 3, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 6, 2023

Choose a reason for hiding this comment

AdamOrmondroyd commented Mar 7, 2023 • edited Loading

jbrockmendel commented Mar 7, 2023

AdamOrmondroyd commented Mar 10, 2023

jbrockmendel commented Mar 10, 2023

AdamOrmondroyd commented Mar 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jbrockmendel commented Mar 14, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Aug 11, 2023

`SubClassedDataFrame.groupby().mean()` etc. use method of `SubClassedDataFrame` #51765

`SubClassedDataFrame.groupby().mean()` etc. use method of `SubClassedDataFrame` #51765

AdamOrmondroyd commented Mar 3, 2023 •

edited

Loading

AdamOrmondroyd Mar 3, 2023 •

edited

Loading

AdamOrmondroyd commented Mar 7, 2023 •

edited

Loading