[WIP, ENH] Adds cumulative methods to ea #28509

datajanko · 2019-09-18T19:32:41Z

closes Nullable Int64 column changes type after some (cumsum) operations #28385
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Problem specific

Add abstract tests similar to pandas/tests/extension/base/reduce.py
Add TestFixtures for Accumulations
Implement Accumulation for IntegerArray
Implement Accumulation for DecimalArray - out of scope
Implement Accumulation for DatetileLikeArrayMixin ? - out of scope
Use BaseNoAccumulateTest where applicable - implemented this for categorical

updates pandas

TomAugspurger

Can you add base tests similar to the ones in pandas/tests/extension/base/reduce.py?

Then you'll need to implement _accumulate on

IntegerArray
DatetimeLikeArrayMixin? (or maybe a subclass?)
DecimalArray

What's the return type on ExtensionArray[T]._accumulate? Is it ExtensionArray[T]? Or can you return an ExtensionArray of a different type?

For example, I could imagine a BoolArray.cumsum() that returns an IntegerArray.

pandas/core/arrays/base.py

datajanko · 2019-09-19T10:25:31Z

Sorry, I did not intend to have a proper code review, essentially, I just replaced reduce with accumulate and created duplicates. Will have a deeper dive into the consequences soon.

I'll follow your guidance and start implementing the test soon.

As you pointed out, we should often see a different Type ExtensionArray[S]. Additionally to your example, cummax could potentially return signed integers. E.g. on an array [1, -1, 1, -1, ... ].

datajanko · 2019-09-25T19:48:55Z

I was expecting that in extension arrays, the _reduce function is used to define e.g. sum, max
etc. However, this does not seem to happen (maybe as of now). So I'm trying to find, where an ExtensionArray gets these functions. And I'm only guessing:

I'd assume that an extension array typically is seen as a Series. Here the _reduce function is defined. Moreover, we inherit from base.IndexOpsMixin which defines e.g. max, min etc.

On the other hand, on the Series, we perform e.g. _add_numeric_operations (defined for NDFrames (which inherit from pandas objects) which uses the _reduce function of the object.

So what is the right interpretation here? Thanks in advance

jbrockmendel · 2019-10-28T19:29:29Z

I was expecting that in extension arrays, the _reduce function is used to define e.g. sum, max
etc. However, this does not seem to happen (maybe as of now). So I'm trying to find, where an ExtensionArray gets these functions.

This is up to EA authors how to implement. e.g. sum is not necessarily well-defined for some EAs (like DatetimeArray) so we can't define it in the general case

datajanko · 2019-11-05T20:11:28Z

Thanks for getting back to me. I just realized, that e.g. IntegerArray only uses _reduce and does not provide a sum property (which I expected). I hope to have a bit more time soon.

alimcmaster1 · 2020-01-05T00:12:16Z

@datajanko - is this still active? Mind addressing the comments above and fix up the tests. Feel free to post on here if you run into any issues!

datajanko · 2020-01-05T07:58:04Z

Yes, I'm still working on this. I think so far I managed to create the abstract test classes. Next step is to implement the functionality for integer arrays. Which comments do you mean in particular? Maybe it might even make sense to split this issue into multiple.

datajanko · 2020-01-09T08:17:56Z

@TomAugspurger I just realized that your comment on the dtypes is crucial. Doing the cumsum on the first 100 digits, will directly raise this issue for Int8 dtypes. So what should be the strategy here?

In terms of cumsum, an implementation could be:

replace the data at masked positions with 0 (neutral element of addition). Compute the cumsum (using numpy's cumsum). Now find the max and min value. And determine an underlying numpy type that is suitable. Now create a new array (or change the params of the existing array if possible) and return this.

The same would hold for cumprod (but leaving the values at the masked position to be 1).

cummax and cummin would not have this issue of exceeding the bounds of the dtype

TomAugspurger · 2020-01-09T15:20:00Z

Replacing masked values with 0 (or 1 for prod) seems reasonable. We'll need to decide if we want to match NumPy's behavior (which silently overflows) or whether we want to catch that. Either is fine by me.

Compute the cumsum (using numpy's cumsum). Now find the max and min value.

I don't know that that'll work, because NumPy silently overflows.

datajanko · 2020-01-09T19:25:34Z

On the _data of an extension array, cumsum does not flow over silently, but does the correct aggregation. For the dtype parameter in the cumsum function, the documentation states:

Type of the returned array and of the accumulator in which the elements are summed. If dtype is not specified, it defaults to the dtype of a, unless a has an integer dtype with a precision less than that of the default platform integer. In that case, the default platform integer is used.

So apparently, an easy implementation would be to always return Int64 Dtype, which is not optimal but I think a suitable solution at least at the beginning.

datajanko · 2020-01-09T20:24:27Z

Okay, this should be a rough idea in the happy cases. Right now, the tests still fail since I haven't connected ._accumulate(name="cumsum") to the cumsum function of the Series

WillAyd · 2020-02-12T00:49:34Z

@datajanko is this still active?

datajanko · 2020-02-12T05:17:20Z

Yes in fact it is, but unfortunately time is scarce for me currently. Should be better by the end of February. If you want to allocate more/different resources on this issue, I'm of course fine with that.

Updates fork

WillAyd · 2020-03-14T22:05:10Z

@datajanko still active? If so can you fix up and try to get green?

updates from upstream

datajanko · 2021-01-25T20:42:58Z

@simonjayhawkins @jreback

Currently, tests for float masked arrays and test for sparse masked arrays are missing.
Other than that:

I realized that in boolean arrays, the astype method does currently not cast to integer arrays. However, in the _accumulate function for boolean arrays, I'm somehow using this cast. Do you want me to update the implementation of as type here?

Moreover, cumprod is nasty to test due to overflow errors with the given data. Do you have a suggestion on what to do here?
In a previous implementation I just restricted the data (whenever the op was cumprod) to the first 20 entries.

Is the approach okay, or would you rather prefer a different design somewhere?

datajanko · 2021-01-26T20:55:42Z

How do you want to scope the ticket?

Decimal Array is not there anymore it seems.
It would make sense to add accumulative functions for timedeltas
It could make sense to add cummin and cummax for timestamps
I think it wouldn't make sense to add cumulative methods to periods (now: probably cummin/cummax makes sense)

A cumsum method is defined for sparse arrays, but not via the _accumulate interface. I didn't see any tests. Do we want to have cumulative functions for sparse arrays?

Currently, we see (just using the standard base tests)

cumprod tests failing for integer apparently due to some overflow issues.
various cumulative functions failing for floating arrays (all have skipna=True)

Moreover, timestamps and timedeltas seem to implement some cumulative methods via series and specifically:
we see two failing tests

FAILED pandas/tests/series/test_cumulative.py::TestSeriesCumulativeOps::test_cummin_datetime64[US/Pacific]
FAILED pandas/tests/series/test_cumulative.py::TestSeriesCumulativeOps::test_cummax_datetime64[US/Pacific]

Any comments/suggestions on that?

jbrockmendel · 2021-01-26T21:51:59Z

I think it wouldn't make sense to add cumulative methods to periods

Why not cummin/cummax?

jbrockmendel · 2021-01-26T21:53:39Z

pandas/core/arrays/base.py

+        ------
+        TypeError : subclass does not define accumulations
+        """
+        raise TypeError(f"cannot perform {name} with type {self.dtype}")


NotImplementedError?

jbrockmendel · 2021-01-26T21:55:07Z

test_cummin_datetime64

you'll need to implement _accumulate on core.arrays.datetimelike.DatetimeLikeArrayMixin

Be careful of NaT handling, IIRC this took some wrangling, see nanops.na_accum_func

datajanko · 2021-01-27T07:41:10Z

I think it wouldn't make sense to add cumulative methods to periods

Why not cummin/cummax?

You're right

datajanko · 2021-02-01T21:08:42Z

Hey, apparently, in my local branch I made some progress when implementing _accumulate for date time like arrays. However, whenever I use a series or data frame, I'm failing to call the correct function. In fact, in the block_accum_func I'm only entering the nanops branch of the if-else statement.

What is the desired approach here?

I was trying to track down what happens a bit further and found this TODO

  def _split_op_result(self, result) -> List[Block]:
        # See also: split_and_operate
        if is_extension_array_dtype(result) and result.ndim > 1:
            # TODO(EA2D): unnecessary with 2D EAs
            # if we get a 2D ExtensionArray, we need to split it into 1D pieces

It feels, that this is related or am I wrong here?

jbrockmendel · 2021-02-01T23:30:28Z

you can ignore the # TODO(EA2D), that is just trying to quantify the amount of complexity we could avoid if/when we have 2D EAs.

inside block_accum_func, before doing the isinstance(values, ExtensionArray) check, consider calling construction.ensure_wrapped_if_datetimelike

datajanko · 2021-02-02T10:01:30Z

Awesome, thanks for the hint. I will have a look at this.

datajanko · 2021-02-14T07:42:10Z

Hey, even though I didn't push, I guess I made some progress and have a somewhat working implementation of the arrays.

However, I observed an inconsistency when handling skipna:

pd.Series([0.0, None, 1.0]).cummax(skipna=False)

gives

0    0.0
1    NaN
2    NaN
dtype: float64

but

pd.Series( pd.array([pd.Timestamp("2020-01-01"), pd.NaT, pd.Timestamp('2020-01-02')])).cummax(skipna=False)

gives

0   2020-01-01
1   2020-01-01
2   2020-01-02
dtype: datetime64[ns]

a different behavior.

Apparently, it would be preferable to have uniform behavior (probably the behavior of the floats). But changing this, would also mean a breaking change. I guess, that we want to avoid this. Correct?

pandas/core/arrays/masked.py

additionally, remove min_count as irrellevant

in generic.py ensure that datetimelikes are wrapped create a twin of masked_accumulations for datetimelikes timedeltas also allow cumsum and cumprod, theoretically

datajanko · 2021-02-20T07:58:54Z

Hey, right now we see the following, where I'd like to request your feedback @jbrockmendel @jreback

cumprod fails for skin=False for float32 and integer arrays. Apparently this is due to buffer overruns. We could restrict the dataset to be a bit smaller. Or do you have another suggestion/opinion on what to do?
we see pandas/tests/series/test_cumulative.py as described in my previous comment. What do we want to do here?
in pandas/tests/extension there is no test_timedelta.py. I'm wondering if we want to have accumulation tests also here for datetimelikes?

Apart from that, and except for the typing errors, I guess we are almost done. Or am I missing something?

jbrockmendel · 2021-03-26T22:54:26Z

in pandas/tests/extension there is no test_timedelta.py. I'm wondering if we want to have accumulation tests also here for datetimelikes?

this can either go in a new extensions/test_datetimelike.py file or in arrays/test_datetimelike.yp

cumprod fails for skin=False for float32 and integer arrays. Apparently this is due to buffer overruns. We could restrict the dataset to be a bit smaller. Or do you have another suggestion/opinion on what to do?

@jorisvandenbossche ?

jreback · 2021-10-04T00:13:12Z

closing as stale, if you want to continue working, please ping.

datajanko and others added 4 commits September 13, 2019 20:19

Merge pull request #1 from pandas-dev/master

2897723

updates pandas

Merge branch 'master' of https://github.com/pandas-dev/pandas

a54e1b4

Merge branch 'master' of https://github.com/pandas-dev/pandas

10abc0f

define accumulation interface for ExtensionArrays

c2d7592

TomAugspurger reviewed Sep 18, 2019

View reviewed changes

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

pandas/core/arrays/base.py Outdated Show resolved Hide resolved

TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 18, 2019

datajanko added 2 commits September 19, 2019 20:38

reformulate doc string

2c149c0

creates baseExtension tests for accumulate

79cea11

adds fixtures for numeric_accumulations

12a5ca3

datajanko added 3 commits November 13, 2019 21:49

fixes typos

dc959f4

adds accumulate tests for integer arrays

bcfb8a8

fixes typo

9a8f4ec

first implementation of cumsum

5d837d9

Merge pull request #2 from pandas-dev/master

9e9f0c3

Updates fork

jorisvandenbossche mentioned this pull request Feb 29, 2020

'IntegerArray' object has no attribute 'T' #32342

Closed

Merge pull request #3 from pandas-dev/master

a1a1cb2

updates from upstream

datajanko added 3 commits January 25, 2021 21:21

fixes incorrect call to cumsum and changes to cumprod

d22c8a0

add _accumulate to boolean

a5866c7

makes tests a lot easier - cumprod tests still fail

8255457

datajanko added 2 commits January 26, 2021 20:41

adds BaseNumericAccumulation for floating masked array

483b608

tests no numeric accumulations according to _accumulate interface

150fd3b

jbrockmendel reviewed Jan 26, 2021

View reviewed changes

uses NotImplementedError in base accumulate function

80e2dc6

jorisvandenbossche mentioned this pull request Feb 3, 2021

BUG: cumulative functions with ea dtype not handling NA correctly and casting to object #39483

Closed

4 tasks

jreback reviewed Feb 15, 2021

View reviewed changes

pandas/core/arrays/masked.py Show resolved Hide resolved

datajanko added 7 commits February 16, 2021 20:50

ensures the fill values are data independent

dceab99

additionally, remove min_count as irrellevant

adds accumulation for datetimelikes

1c14f18

in generic.py ensure that datetimelikes are wrapped create a twin of masked_accumulations for datetimelikes timedeltas also allow cumsum and cumprod, theoretically

Merge branch 'master' of https://github.com/pandas-dev/pandas

e20501a

fixes merge conflicts

53147c4

actually ads datetimelike accumulation algos

597e978

fixes absolute imports

5ebe8ea

changes error to catch to adhere to changed implementation

32367c0

jreback closed this Oct 4, 2021

phofl mentioned this pull request Aug 16, 2022

ENH: Add cumulative methods to ea #48111

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP, ENH] Adds cumulative methods to ea #28509

[WIP, ENH] Adds cumulative methods to ea #28509

datajanko commented Sep 18, 2019 •

edited

Loading

TomAugspurger left a comment

datajanko commented Sep 19, 2019

datajanko commented Sep 25, 2019

jbrockmendel commented Oct 28, 2019

datajanko commented Nov 5, 2019

alimcmaster1 commented Jan 5, 2020

datajanko commented Jan 5, 2020 •

edited

Loading

datajanko commented Jan 9, 2020

TomAugspurger commented Jan 9, 2020

datajanko commented Jan 9, 2020 •

edited

Loading

datajanko commented Jan 9, 2020

WillAyd commented Feb 12, 2020

datajanko commented Feb 12, 2020

WillAyd commented Mar 14, 2020

datajanko commented Jan 25, 2021

datajanko commented Jan 26, 2021 •

edited

Loading

jbrockmendel commented Jan 26, 2021

jbrockmendel Jan 26, 2021

jbrockmendel commented Jan 26, 2021

datajanko commented Jan 27, 2021

datajanko commented Feb 1, 2021

jbrockmendel commented Feb 1, 2021

datajanko commented Feb 2, 2021

datajanko commented Feb 14, 2021

datajanko commented Feb 20, 2021 •

edited

Loading

jbrockmendel commented Mar 26, 2021

jreback commented Oct 4, 2021

[WIP, ENH] Adds cumulative methods to ea #28509

[WIP, ENH] Adds cumulative methods to ea #28509

Conversation

datajanko commented Sep 18, 2019 • edited Loading

TomAugspurger left a comment

Choose a reason for hiding this comment

datajanko commented Sep 19, 2019

datajanko commented Sep 25, 2019

jbrockmendel commented Oct 28, 2019

datajanko commented Nov 5, 2019

alimcmaster1 commented Jan 5, 2020

datajanko commented Jan 5, 2020 • edited Loading

datajanko commented Jan 9, 2020

TomAugspurger commented Jan 9, 2020

datajanko commented Jan 9, 2020 • edited Loading

datajanko commented Jan 9, 2020

WillAyd commented Feb 12, 2020

datajanko commented Feb 12, 2020

WillAyd commented Mar 14, 2020

datajanko commented Jan 25, 2021

datajanko commented Jan 26, 2021 • edited Loading

jbrockmendel commented Jan 26, 2021

jbrockmendel Jan 26, 2021

Choose a reason for hiding this comment

jbrockmendel commented Jan 26, 2021

datajanko commented Jan 27, 2021

datajanko commented Feb 1, 2021

jbrockmendel commented Feb 1, 2021

datajanko commented Feb 2, 2021

datajanko commented Feb 14, 2021

datajanko commented Feb 20, 2021 • edited Loading

jbrockmendel commented Mar 26, 2021

jreback commented Oct 4, 2021

datajanko commented Sep 18, 2019 •

edited

Loading

datajanko commented Jan 5, 2020 •

edited

Loading

datajanko commented Jan 9, 2020 •

edited

Loading

datajanko commented Jan 26, 2021 •

edited

Loading

datajanko commented Feb 20, 2021 •

edited

Loading