-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: let EAs override WrappedCythonOp groupby implementations #51166
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
changed groupby_op to _groupby_op and added a docstring making explicit that this API is experimental. Open to formatting suggestions for the kwargs part of the docstring. |
Thinking about the namespace question whether to have a single dispatching method (_groupby_op) vs a bunch (_grouped_sum, _grouped_std, ...). With the single-method (this PR), we have Maybe a middle ground with e.g. _grouped_reduce, _grouped_transform, _grouped_quantile, _grouped_rank so that each of them can have their kwargs made explicit? cc @phofl @mroeschke @jorisvandenbossche |
Is there actually a demand from ExtensionArray authors to be able to implement the groupby reductions? (I mean, I can certainly see that it could be useful for certain kind of extension dtypes, but I would still be wary to implement it if it's not really requested) |
No, but I think there is a chicken/egg issue in that some libraries (i have in mind cudf, dask, modin) haven't implemented EAs at all and this is a necessary step to making EAs a viable option for them.
That is definitely a motivator. This is much nicer abstraction-wise. We agreed long ago that ideally our internal EAs would be treated symmetrically with 3rd party EAs. |
@MarcoGorelli this is the one i mentioned |
@jorisvandenbossche feedback in #51471 seems mostly positive. any thoughts? |
@MarcoGorelli gentle ping |
+1 here. to the extent this cleans up generic pandas code this is quite valuable (not to mention that EA authors can use this). |
@mroeschke gentle ping, this is the way forward for e.g. #51996 |
I get this solves #51996 in the short term, but my only reservation here is the dependence on the existing grouping machinery (the determination of This also applies probably to other dataframe libraries like cuDF/Dask/etc that we hope could utilize the EA interface, but groupby is a "frame-wise" operation where this concern also applies. |
Aside from the 3rd-party concerns, our own EA-specific logic fits much better here than it does in groupby.ops. I've changed this to use a private name to somewhat safeguard against future changes. |
Okay I see the benefit for our EAs currently. Just hoping that you are open to a future API where 3rd party EAs (including pyarrow) could also dictate the grouping + op steps together. |
Absolutely open to it. I'd love to rip out a bunch of our libgroupby code one day. |
Apologies, was off last week (though still did some minor things). I've made it a priority to look at this next week |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No objections, generally looks good, but I'm not familiar enough with this part of the code to say whether this should be the way to go
if isinstance(self.dtype, StringDtype): | ||
dtype = self.dtype | ||
string_array_cls = dtype.construct_array_type() | ||
return string_array_cls._from_sequence(res_values, dtype=dtype) | ||
|
||
else: | ||
raise NotImplementedError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hasn't this if-then already been done above? or is it just to future-proof the code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
future-proof is a nice way of putting it, yes. this is transplanted from its current position in WrappedCythonOp where the redundant checks are in separate methods
I had https://github.com/pandas-dev/pandas/pull/51166/files#r1139482980 but otherwise looked good |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fine, looks like the only changes since I last reviewed are minor mypy fixups and merges from upstream/main
op = WrappedCythonOp(how=how, kind=kind, has_dropped_na=has_dropped_na) | ||
|
||
# libgroupby functions are responsible for NOT altering mask | ||
mask = self._mask |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably add tests covering this at some point. Not sure if we already have them
thx @jbrockmendel |
Goal is to allow EAs (including our own pyarrow-backed) to implement performant GroupBy reductions/transforms without necessarily having to convert to numpy.
Analogous to #51003 (quantile) and #51116 (any, all, std).
This implements EA.grouby_op which takes a "how" keyword to specify which reduction/transform is being done. By contrast, #51003 implements EA.groupby_quantile and #51116 implements EA.groupby_std and EA.groupby_any_all. If we go this direction (we should), we should choose one naming scheme for all these methods instead of three slightly different ones. i.e. either a) shove everything into something like
groupby_op
or b) explicitly havegroupby_foo
for each of the relevant methods.Also will need docs before moving forward.