-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[WIP, ENH] Adds cumulative methods to ea #28509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP, ENH] Adds cumulative methods to ea #28509
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add base tests similar to the ones in pandas/tests/extension/base/reduce.py
?
Then you'll need to implement _accumulate
on
- IntegerArray
- DatetimeLikeArrayMixin? (or maybe a subclass?)
- DecimalArray
What's the return type on ExtensionArray[T]._accumulate
? Is it ExtensionArray[T]
? Or can you return an ExtensionArray of a different type?
For example, I could imagine a BoolArray.cumsum()
that returns an IntegerArray.
Sorry, I did not intend to have a proper code review, essentially, I just replaced reduce with accumulate and created duplicates. Will have a deeper dive into the consequences soon. I'll follow your guidance and start implementing the test soon. As you pointed out, we should often see a different Type |
I was expecting that in extension arrays, the I'd assume that an extension array typically is seen as a Series. Here the On the other hand, on the Series, we perform e.g. So what is the right interpretation here? Thanks in advance |
This is up to EA authors how to implement. e.g. |
Thanks for getting back to me. I just realized, that e.g. IntegerArray only uses |
@datajanko - is this still active? Mind addressing the comments above and fix up the tests. Feel free to post on here if you run into any issues! |
Yes, I'm still working on this. I think so far I managed to create the abstract test classes. Next step is to implement the functionality for integer arrays. Which comments do you mean in particular? Maybe it might even make sense to split this issue into multiple. |
@TomAugspurger I just realized that your comment on the dtypes is crucial. Doing the cumsum on the first 100 digits, will directly raise this issue for Int8 dtypes. So what should be the strategy here? In terms of cumsum, an implementation could be: replace the data at masked positions with 0 (neutral element of addition). Compute the cumsum (using numpy's cumsum). Now find the max and min value. And determine an underlying numpy type that is suitable. Now create a new array (or change the params of the existing array if possible) and return this. The same would hold for cumprod (but leaving the values at the masked position to be 1). cummax and cummin would not have this issue of exceeding the bounds of the dtype |
Replacing masked values with 0 (or 1 for prod) seems reasonable. We'll need to decide if we want to match NumPy's behavior (which silently overflows) or whether we want to catch that. Either is fine by me.
I don't know that that'll work, because NumPy silently overflows. |
On the _data of an extension array, cumsum does not flow over silently, but does the correct aggregation. For the dtype parameter in the cumsum function, the documentation states:
So apparently, an easy implementation would be to always return |
Okay, this should be a rough idea in the happy cases. Right now, the tests still fail since I haven't connected |
@datajanko is this still active? |
Yes in fact it is, but unfortunately time is scarce for me currently. Should be better by the end of February. If you want to allocate more/different resources on this issue, I'm of course fine with that. |
Updates fork
@datajanko still active? If so can you fix up and try to get green? |
updates from upstream
Currently, tests for float masked arrays and test for sparse masked arrays are missing. I realized that in boolean arrays, the Moreover, cumprod is nasty to test due to overflow errors with the given data. Do you have a suggestion on what to do here? Is the approach okay, or would you rather prefer a different design somewhere? |
How do you want to scope the ticket?
A Currently, we see (just using the standard base tests)
Moreover, timestamps and timedeltas seem to implement some cumulative methods via series and specifically:
Any comments/suggestions on that? |
Why not cummin/cummax? |
pandas/core/arrays/base.py
Outdated
------ | ||
TypeError : subclass does not define accumulations | ||
""" | ||
raise TypeError(f"cannot perform {name} with type {self.dtype}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NotImplementedError?
you'll need to implement _accumulate on core.arrays.datetimelike.DatetimeLikeArrayMixin Be careful of NaT handling, IIRC this took some wrangling, see nanops.na_accum_func |
You're right |
Hey, apparently, in my local branch I made some progress when implementing What is the desired approach here? I was trying to track down what happens a bit further and found this TODO
It feels, that this is related or am I wrong here? |
you can ignore the inside block_accum_func, before doing the |
Awesome, thanks for the hint. I will have a look at this. |
Hey, even though I didn't push, I guess I made some progress and have a somewhat working implementation of the arrays. However, I observed an inconsistency when handling skipna:
gives
but
gives
a different behavior. Apparently, it would be preferable to have uniform behavior (probably the behavior of the floats). But changing this, would also mean a breaking change. I guess, that we want to avoid this. Correct? |
additionally, remove min_count as irrellevant
in generic.py ensure that datetimelikes are wrapped create a twin of masked_accumulations for datetimelikes timedeltas also allow cumsum and cumprod, theoretically
Hey, right now we see the following, where I'd like to request your feedback @jbrockmendel @jreback
Apart from that, and except for the typing errors, I guess we are almost done. Or am I missing something? |
this can either go in a new extensions/test_datetimelike.py file or in arrays/test_datetimelike.yp
|
closing as stale, if you want to continue working, please ping. |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Problem specific