Skip to content

API: Dispatch mechanism for EA reductions #33790

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dsaxton opened this issue Apr 25, 2020 · 5 comments
Closed

API: Dispatch mechanism for EA reductions #33790

dsaxton opened this issue Apr 25, 2020 · 5 comments
Labels
API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action

Comments

@dsaxton
Copy link
Member

dsaxton commented Apr 25, 2020

#33538 implemented sum as a standalone method for IntegerArray which somewhat duplicates the already-existing _reduce path taken by DataFrame / Series for certain reductions (see #33538 (review)). Likely it makes sense to implement these at the IntegerArray / BooleanArray / etc. level using that same approach, e.g., something like defining each reduction to simply call _reduce with the appropriate operation name instead of copy / pasting the logic into each method.

@dsaxton dsaxton added Needs Triage Issue that has not been reviewed by a pandas team member Usage Question API Design Needs Discussion Requires discussion from core team before further action and removed Needs Triage Issue that has not been reviewed by a pandas team member Usage Question labels Apr 25, 2020
@jorisvandenbossche jorisvandenbossche added the ExtensionArray Extending pandas with custom dtypes or arrays. label Apr 29, 2020
@jorisvandenbossche
Copy link
Member

So some questions we need to discuss on this topic:

  • Do we want to add the user-facing reduction methods to the Array classes? (like we now added IntegerArray.sum and StringArray.min/max
    • They are not necessarily needed internally in pandas (unless we change the EA interface for reductions, see below), but could be nice for usability of the Array classes.
    • Adding them can give some basic compatibility with numpy (eg np.sum(int_array)), although this could also be done by adding support for numpy protocols (__array_funcion__))
  • The current EA interface for reduction operations is that an EA needs to implement EA._reduce(op_name, ...), and the Series/DataFrame reduction methods will eventually call this.
    • Are we happy with this interface? Or do we want to change it by letting Series/DataFrame call the actual operation-specific methods?
  • If we keep the EA._reduce interface, that means that adding the actual methods is optional ? Meaning, the EA author can choose themselves whether they add a sum(), min(), .. methods to the EA? (or eg only add those that makes sense for the data type)
    • This means that internally, we should not rely on those methods being present.
    • Alternatively, we could also add those methods to the ExtensionArray base class, so those are automaticaly (and consistently) added to all EA classes. But, this would also add methods to an EA class that potentially don't make any sense (eg adding kurt and skew to non-numeric EAs).
  • If we are going to add more of those methods on our own EAs (regardless of doing that on the base class or not, see question above), how do we want to do this:
    • We implement in _reduce the actual logic for all reductions, and let sum(), min() etc dispatch to _reduce
    • We implement sum(), min(), etc and let _reduce call those (which makes _reduce basically a getattr(self, name)(**kwargs) or raise TypeError if the method is not available on the class).
    • Or implement them both independently, which is what we basically did now for IntegerArray.sum (and since, in that case, both simply call to masked_reductions.sum(..), it's also not much duplication, or longer than calling _reduce)

cc @jbrockmendel @TomAugspurger @jreback

@jreback
Copy link
Contributor

jreback commented Apr 29, 2020

  • I would ideally / eventually switch the dispatch mechanism for Series to __array_function__ ` (ideally because this will take a transition period). This will enable us to treat EA and numpy arrays on the same footing (and avoid special casing)
  • We already have some Mixins to EA that support numeric ops. So we should generically define these to use the aggregation mechansim (__array_function__ / _reduce). We should add another Mixin that only supports min / max (e.g. non-numeric reductions). I guess sum is 'special' in that it is enabled (plus min/max for strings), leave this as a special case.
  • For the actual implementation. I would define the common numeric aggregations generically on the base EA numeric mixin (sum,min,max,median,var,std,mean); these should contain the doc-strings. These should simply dispatch to _array_function__ / _reduce as the entry point for actual computation (e.g. to generically override all operations in a sub-class). Of course a sub-class is also free to override these individually if it makes sense. This way the impl for a sub-class is simple (1 method).

This is different that what we have now (we look for an override method first in _reduce then operate generically), but I think this doesn't make sense as far as user facing methods, e.g. we need a method somewhere to put the doc-string.

This pattern is very similar to what we do elsewhere (e.g. we mostly dispatch to nanops to do the actual computation with Series / Index level methods).

@jorisvandenbossche
Copy link
Member

[on using __array_function__] This will enable us to treat EA and numpy arrays on the same footing (and avoid special casing)

Although I think it is a good idea to implement those numpy protocols for compatibility, I am not fully sure if we can actually use it internally, since we have additional keywords compared to numpy (skipna, min_count).

Having something similar as ExtensionOpsMixin for the reduction ops (or add it to the existing mixin) sounds good to me. Then in principle, the EAs that want those reduction ops can simply use the mixin and only implement _reduce themselves.

@rhshadrach
Copy link
Member

rhshadrach commented Nov 20, 2021

#44442 added a cast because BooleanArray does not currently have a sum attribute. This cast can be removed on the resolution of this issue.

@jbrockmendel
Copy link
Member

We now have a basic ExtensionArray._reduce method and both Series._reduce and DataFrame._reduce dispatch to it. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement ExtensionArray Extending pandas with custom dtypes or arrays. Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

6 participants