Skip to content

SubClassedDataFrame.groupby().mean() etc. use method of SubClassedDataFrame #51765

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 39 commits into from

Conversation

AdamOrmondroyd
Copy link
Contributor

@AdamOrmondroyd AdamOrmondroyd commented Mar 3, 2023

I appreciate that changing the behaviour for all subclasses is overkill, so pointers on how to better do this would be great

@@ -687,7 +689,7 @@ def value_counts(
# in a backward compatible way
# GH38672 relates to categorical dtype
ser = self.apply(
Series.value_counts,
self._obj_1d_constructor.value_counts,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is trouble bc in general we can't assume that _constructor is a class

Copy link
Contributor Author

@AdamOrmondroyd AdamOrmondroyd Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What else could it be (practically speaking, I know it's Callable)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geopandas has a callable that can dispatch to different classes. @jorisvandenbossche has argued against deprecating allowing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a check whether self._obj_1d_constructor is a Series

@jorisvandenbossche
Copy link
Member

I appreciate that changing the behaviour for all subclasses is overkill

I think that this might indeed be an issue for subclasses that don't override those methods. If we want to keep a fix like this, maybe checking whether the method is the one from pandas or not might be better (something like self.obj.mean is Series.mean instead of type(self.obj) is Series).

Because with the current change, if you have a subclass that doesn't influence how aggregations are done, you would get a big slowdown because of the current PR I think?

if is_categorical_dtype(val.dtype) or (
bins is not None and not np.iterable(bins)
):
# scalar bins cannot be done at top level
# in a backward compatible way
# GH38672 relates to categorical dtype
ser = self.apply(
Series.value_counts,
constructor_1d.value_counts,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this also be something like lambda group, **kwargs: group.value_counts(**kwargs)?

(didn't look in detail at how this code is working, so potentially this doesn't make sense at all)

@AdamOrmondroyd AdamOrmondroyd changed the title SubClassedDataFrame.groupby.mean() etc. use method of SubClassedDataFrame SubClassedDataFrame.groupby().mean() etc. use method of SubClassedDataFrame Mar 6, 2023
@AdamOrmondroyd
Copy link
Contributor Author

AdamOrmondroyd commented Mar 7, 2023

Is the WeightedDataFrame thing referenced in #51757 the motivating case here?

Yes, our classes add a set of "weights" to the index which are used to calculate statistics such as the mean.

What if instead of this we did did something like

class DataFrame:
    @property
    def _gb_cls(self):
        return DataFrameGroupBy

and then had DataFrame.groupby return an instance of _gb_cls. Then subclasses could override that.

That would function, though feel this would be unexpected extra work for developers, so should be mentioned in the extending pandas section of the docs.

I was very surprised that groupby() doesn't just use the methods of its underlying class. If DataFrame.mean() and Series.mean() are themselves optimised, what is gained by GroupBy reimplementing them?

@jbrockmendel
Copy link
Member

If DataFrame.mean() and Series.mean() are themselves optimised, what is gained by GroupBy reimplementing them?

Performance. The paths that iterate over groups generally make a sorted copy of the DataFrame to make that iteration faster, but the iteration itself is still costly.

@AdamOrmondroyd
Copy link
Contributor Author

If DataFrame.mean() and Series.mean() are themselves optimised, what is gained by GroupBy reimplementing them?

Performance. The paths that iterate over groups generally make a sorted copy of the DataFrame to make that iteration faster, but the iteration itself is still costly.

but, if these behaviours only change for subclasses, then performance is only lost for subclasses which override mean etc, in which case surely it is preferable that the intended results are returned rather than a fast one?

@jbrockmendel
Copy link
Member

but, if these behaviours only change for subclasses, then performance is only lost for subclasses which override mean etc, in which case surely it is preferable that the intended results are returned rather than a fast one?

The question was why we use a cython path instead of iterating over DataFrame.mean. I gave an explanation of the status quo, not an argument against the fix suggested here.

@AdamOrmondroyd
Copy link
Contributor Author

but, if these behaviours only change for subclasses, then performance is only lost for subclasses which override mean etc, in which case surely it is preferable that the intended results are returned rather than a fast one?

The question was why we use a cython path instead of iterating over DataFrame.mean. I gave an explanation of the status quo, not an argument against the fix suggested here.

Apologies, I misunderstood the difference you were highlighting.

How do you think I should proceed?

@@ -681,14 +683,20 @@ def value_counts(

index_names = self.grouper.names + [self.obj.name]

constructor_1d = (
self._obj_1d_constructor
if isinstance(self._obj_1d_constructor, Series)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think issubclass rather than isinstance?

@@ -1290,9 +1298,9 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs)
elif relabeling:
# this should be the only (non-raising) case with relabeling
# used reordered index of columns
result = cast(DataFrame, result)
result = cast(self.obj._constructor, result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these will be wrong if _constructor is not a class

@jbrockmendel
Copy link
Member

How do you think I should proceed?

Unless there is a viable way to handle your use case with an ExtensionArray (i.e. the weights would be somehow included with the values, not at the DataFrame level), then the approach here seems fine (albeit ugly)

@@ -223,6 +223,10 @@ Plotting
- Bug in :meth:`Series.plot` when invoked with ``color=None`` (:issue:`51953`)
-

Groupby
- Bug in :meth:`GroupBy.mean`, :meth:`GroupBy.median`, :meth:`GroupBy.std`, :meth:`GroupBy.var`, :meth:`GroupBy.sem`, :meth:`GroupBy.prod`, :meth:`GroupBy.min`, :meth:`GroupBy.max` don't use corresponding methods of subclasses of :class:`Series` or :class:`DataFrame` (:issue:`51757`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this belongs below in the Groupby/resample/rolling section?

if not (
type(self.obj).median is Series.median
or type(self.obj).median is DataFrame.median
):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this pattern is going to show up a lot, does it merit a decorator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a first pass at making a decorator, not sure how to deal with engine and engine_kwargs

@mroeschke mroeschke added Groupby Subclassing Subclassing pandas objects labels May 18, 2023
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

@mroeschke mroeschke closed this Aug 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Subclassing Subclassing pandas objects
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DataFrameSubClass.groupby() doesn't use methods of subclass
6 participants