-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
SubClassedDataFrame.groupby().mean()
etc. use method of SubClassedDataFrame
#51765
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -687,7 +689,7 @@ def value_counts( | |||
# in a backward compatible way | |||
# GH38672 relates to categorical dtype | |||
ser = self.apply( | |||
Series.value_counts, | |||
self._obj_1d_constructor.value_counts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is trouble bc in general we can't assume that _constructor is a class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What else could it be (practically speaking, I know it's Callable)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Geopandas has a callable that can dispatch to different classes. @jorisvandenbossche has argued against deprecating allowing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added a check whether self._obj_1d_constructor
is a Series
I think that this might indeed be an issue for subclasses that don't override those methods. If we want to keep a fix like this, maybe checking whether the method is the one from pandas or not might be better (something like Because with the current change, if you have a subclass that doesn't influence how aggregations are done, you would get a big slowdown because of the current PR I think? |
pandas/core/groupby/generic.py
Outdated
if is_categorical_dtype(val.dtype) or ( | ||
bins is not None and not np.iterable(bins) | ||
): | ||
# scalar bins cannot be done at top level | ||
# in a backward compatible way | ||
# GH38672 relates to categorical dtype | ||
ser = self.apply( | ||
Series.value_counts, | ||
constructor_1d.value_counts, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this also be something like lambda group, **kwargs: group.value_counts(**kwargs)
?
(didn't look in detail at how this code is working, so potentially this doesn't make sense at all)
SubClassedDataFrame.groupby.mean()
etc. use method of SubClassedDataFrame
SubClassedDataFrame.groupby().mean()
etc. use method of SubClassedDataFrame
Yes, our classes add a set of "weights" to the index which are used to calculate statistics such as the mean.
That would function, though feel this would be unexpected extra work for developers, so should be mentioned in the extending pandas section of the docs. I was very surprised that groupby() doesn't just use the methods of its underlying class. If |
Performance. The paths that iterate over groups generally make a sorted copy of the DataFrame to make that iteration faster, but the iteration itself is still costly. |
but, if these behaviours only change for subclasses, then performance is only lost for subclasses which override mean etc, in which case surely it is preferable that the intended results are returned rather than a fast one? |
The question was why we use a cython path instead of iterating over DataFrame.mean. I gave an explanation of the status quo, not an argument against the fix suggested here. |
Apologies, I misunderstood the difference you were highlighting. How do you think I should proceed? |
pandas/core/groupby/generic.py
Outdated
@@ -681,14 +683,20 @@ def value_counts( | |||
|
|||
index_names = self.grouper.names + [self.obj.name] | |||
|
|||
constructor_1d = ( | |||
self._obj_1d_constructor | |||
if isinstance(self._obj_1d_constructor, Series) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think issubclass
rather than isinstance
?
@@ -1290,9 +1298,9 @@ def aggregate(self, func=None, *args, engine=None, engine_kwargs=None, **kwargs) | |||
elif relabeling: | |||
# this should be the only (non-raising) case with relabeling | |||
# used reordered index of columns | |||
result = cast(DataFrame, result) | |||
result = cast(self.obj._constructor, result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these will be wrong if _constructor is not a class
Unless there is a viable way to handle your use case with an ExtensionArray (i.e. the weights would be somehow included with the values, not at the DataFrame level), then the approach here seems fine (albeit ugly) |
@@ -223,6 +223,10 @@ Plotting | |||
- Bug in :meth:`Series.plot` when invoked with ``color=None`` (:issue:`51953`) | |||
- | |||
|
|||
Groupby | |||
- Bug in :meth:`GroupBy.mean`, :meth:`GroupBy.median`, :meth:`GroupBy.std`, :meth:`GroupBy.var`, :meth:`GroupBy.sem`, :meth:`GroupBy.prod`, :meth:`GroupBy.min`, :meth:`GroupBy.max` don't use corresponding methods of subclasses of :class:`Series` or :class:`DataFrame` (:issue:`51757`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this belongs below in the Groupby/resample/rolling section?
pandas/core/groupby/groupby.py
Outdated
if not ( | ||
type(self.obj).median is Series.median | ||
or type(self.obj).median is DataFrame.median | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this pattern is going to show up a lot, does it merit a decorator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a first pass at making a decorator, not sure how to deal with engine
and engine_kwargs
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
DataFrameSubClass.groupby()
doesn't use methods of subclass #51757doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.I appreciate that changing the behaviour for all subclasses is overkill, so pointers on how to better do this would be great