Skip to content

[WIP, ENH] Adds cumulative methods to ea #28509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

datajanko
Copy link
Contributor

@datajanko datajanko commented Sep 18, 2019

Problem specific

  • Add abstract tests similar to pandas/tests/extension/base/reduce.py
  • Add TestFixtures for Accumulations
  • Implement Accumulation for IntegerArray
  • Implement Accumulation for DecimalArray - out of scope
  • Implement Accumulation for DatetileLikeArrayMixin ? - out of scope
  • Use BaseNoAccumulateTest where applicable - implemented this for categorical

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add base tests similar to the ones in pandas/tests/extension/base/reduce.py?

Then you'll need to implement _accumulate on

  • IntegerArray
  • DatetimeLikeArrayMixin? (or maybe a subclass?)
  • DecimalArray

What's the return type on ExtensionArray[T]._accumulate? Is it ExtensionArray[T]? Or can you return an ExtensionArray of a different type?

For example, I could imagine a BoolArray.cumsum() that returns an IntegerArray.

@TomAugspurger TomAugspurger added ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations labels Sep 18, 2019
@datajanko
Copy link
Contributor Author

Sorry, I did not intend to have a proper code review, essentially, I just replaced reduce with accumulate and created duplicates. Will have a deeper dive into the consequences soon.

I'll follow your guidance and start implementing the test soon.

As you pointed out, we should often see a different Type ExtensionArray[S]. Additionally to your example, cummax could potentially return signed integers. E.g. on an array [1, -1, 1, -1, ... ].

@datajanko
Copy link
Contributor Author

I was expecting that in extension arrays, the _reduce function is used to define e.g. sum, max
etc. However, this does not seem to happen (maybe as of now). So I'm trying to find, where an ExtensionArray gets these functions. And I'm only guessing:

I'd assume that an extension array typically is seen as a Series. Here the _reduce function is defined. Moreover, we inherit from base.IndexOpsMixin which defines e.g. max, min etc.

On the other hand, on the Series, we perform e.g. _add_numeric_operations (defined for NDFrames (which inherit from pandas objects) which uses the _reduce function of the object.

So what is the right interpretation here? Thanks in advance

@jbrockmendel
Copy link
Member

I was expecting that in extension arrays, the _reduce function is used to define e.g. sum, max
etc. However, this does not seem to happen (maybe as of now). So I'm trying to find, where an ExtensionArray gets these functions.

This is up to EA authors how to implement. e.g. sum is not necessarily well-defined for some EAs (like DatetimeArray) so we can't define it in the general case

@datajanko
Copy link
Contributor Author

Thanks for getting back to me. I just realized, that e.g. IntegerArray only uses _reduce and does not provide a sum property (which I expected). I hope to have a bit more time soon.

@alimcmaster1
Copy link
Member

@datajanko - is this still active? Mind addressing the comments above and fix up the tests. Feel free to post on here if you run into any issues!

@datajanko
Copy link
Contributor Author

datajanko commented Jan 5, 2020

Yes, I'm still working on this. I think so far I managed to create the abstract test classes. Next step is to implement the functionality for integer arrays. Which comments do you mean in particular? Maybe it might even make sense to split this issue into multiple.

@datajanko
Copy link
Contributor Author

@TomAugspurger I just realized that your comment on the dtypes is crucial. Doing the cumsum on the first 100 digits, will directly raise this issue for Int8 dtypes. So what should be the strategy here?

In terms of cumsum, an implementation could be:

replace the data at masked positions with 0 (neutral element of addition). Compute the cumsum (using numpy's cumsum). Now find the max and min value. And determine an underlying numpy type that is suitable. Now create a new array (or change the params of the existing array if possible) and return this.

The same would hold for cumprod (but leaving the values at the masked position to be 1).

cummax and cummin would not have this issue of exceeding the bounds of the dtype

@TomAugspurger
Copy link
Contributor

Replacing masked values with 0 (or 1 for prod) seems reasonable. We'll need to decide if we want to match NumPy's behavior (which silently overflows) or whether we want to catch that. Either is fine by me.

Compute the cumsum (using numpy's cumsum). Now find the max and min value.

I don't know that that'll work, because NumPy silently overflows.

@datajanko
Copy link
Contributor Author

datajanko commented Jan 9, 2020

On the _data of an extension array, cumsum does not flow over silently, but does the correct aggregation. For the dtype parameter in the cumsum function, the documentation states:

Type of the returned array and of the accumulator in which the elements are summed. If dtype is not specified, it defaults to the dtype of a, unless a has an integer dtype with a precision less than that of the default platform integer. In that case, the default platform integer is used.

So apparently, an easy implementation would be to always return Int64 Dtype, which is not optimal but I think a suitable solution at least at the beginning.

@datajanko
Copy link
Contributor Author

Okay, this should be a rough idea in the happy cases. Right now, the tests still fail since I haven't connected ._accumulate(name="cumsum") to the cumsum function of the Series

@WillAyd
Copy link
Member

WillAyd commented Feb 12, 2020

@datajanko is this still active?

@datajanko
Copy link
Contributor Author

Yes in fact it is, but unfortunately time is scarce for me currently. Should be better by the end of February. If you want to allocate more/different resources on this issue, I'm of course fine with that.

@WillAyd
Copy link
Member

WillAyd commented Mar 14, 2020

@datajanko still active? If so can you fix up and try to get green?

@datajanko
Copy link
Contributor Author

@simonjayhawkins @jreback

Currently, tests for float masked arrays and test for sparse masked arrays are missing.
Other than that:

I realized that in boolean arrays, the astype method does currently not cast to integer arrays. However, in the _accumulate function for boolean arrays, I'm somehow using this cast. Do you want me to update the implementation of as type here?

Moreover, cumprod is nasty to test due to overflow errors with the given data. Do you have a suggestion on what to do here?
In a previous implementation I just restricted the data (whenever the op was cumprod) to the first 20 entries.

Is the approach okay, or would you rather prefer a different design somewhere?

@datajanko
Copy link
Contributor Author

datajanko commented Jan 26, 2021

How do you want to scope the ticket?

  • Decimal Array is not there anymore it seems.
  • It would make sense to add accumulative functions for timedeltas
  • It could make sense to add cummin and cummax for timestamps
  • I think it wouldn't make sense to add cumulative methods to periods (now: probably cummin/cummax makes sense)

A cumsum method is defined for sparse arrays, but not via the _accumulate interface. I didn't see any tests. Do we want to have cumulative functions for sparse arrays?

Currently, we see (just using the standard base tests)

  • cumprod tests failing for integer apparently due to some overflow issues.
  • various cumulative functions failing for floating arrays (all have skipna=True)

Moreover, timestamps and timedeltas seem to implement some cumulative methods via series and specifically:
we see two failing tests

FAILED pandas/tests/series/test_cumulative.py::TestSeriesCumulativeOps::test_cummin_datetime64[US/Pacific]
FAILED pandas/tests/series/test_cumulative.py::TestSeriesCumulativeOps::test_cummax_datetime64[US/Pacific]

Any comments/suggestions on that?

@jbrockmendel
Copy link
Member

I think it wouldn't make sense to add cumulative methods to periods

Why not cummin/cummax?

------
TypeError : subclass does not define accumulations
"""
raise TypeError(f"cannot perform {name} with type {self.dtype}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NotImplementedError?

@jbrockmendel
Copy link
Member

test_cummin_datetime64

you'll need to implement _accumulate on core.arrays.datetimelike.DatetimeLikeArrayMixin

Be careful of NaT handling, IIRC this took some wrangling, see nanops.na_accum_func

@datajanko
Copy link
Contributor Author

I think it wouldn't make sense to add cumulative methods to periods

Why not cummin/cummax?

You're right

@datajanko
Copy link
Contributor Author

Hey, apparently, in my local branch I made some progress when implementing _accumulate for date time like arrays. However, whenever I use a series or data frame, I'm failing to call the correct function. In fact, in the block_accum_func I'm only entering the nanops branch of the if-else statement.

What is the desired approach here?

I was trying to track down what happens a bit further and found this TODO

  def _split_op_result(self, result) -> List[Block]:
        # See also: split_and_operate
        if is_extension_array_dtype(result) and result.ndim > 1:
            # TODO(EA2D): unnecessary with 2D EAs
            # if we get a 2D ExtensionArray, we need to split it into 1D pieces

It feels, that this is related or am I wrong here?

@jbrockmendel
Copy link
Member

you can ignore the # TODO(EA2D), that is just trying to quantify the amount of complexity we could avoid if/when we have 2D EAs.

inside block_accum_func, before doing the isinstance(values, ExtensionArray) check, consider calling construction.ensure_wrapped_if_datetimelike

@datajanko
Copy link
Contributor Author

Awesome, thanks for the hint. I will have a look at this.

@datajanko
Copy link
Contributor Author

Hey, even though I didn't push, I guess I made some progress and have a somewhat working implementation of the arrays.

However, I observed an inconsistency when handling skipna:

pd.Series([0.0, None, 1.0]).cummax(skipna=False)

gives

0    0.0
1    NaN
2    NaN
dtype: float64

but

pd.Series( pd.array([pd.Timestamp("2020-01-01"), pd.NaT, pd.Timestamp('2020-01-02')])).cummax(skipna=False)

gives

0   2020-01-01
1   2020-01-01
2   2020-01-02
dtype: datetime64[ns]

a different behavior.

Apparently, it would be preferable to have uniform behavior (probably the behavior of the floats). But changing this, would also mean a breaking change. I guess, that we want to avoid this. Correct?

@datajanko
Copy link
Contributor Author

datajanko commented Feb 20, 2021

Hey, right now we see the following, where I'd like to request your feedback @jbrockmendel @jreback

  • cumprod fails for skin=False for float32 and integer arrays. Apparently this is due to buffer overruns. We could restrict the dataset to be a bit smaller. Or do you have another suggestion/opinion on what to do?
  • we see pandas/tests/series/test_cumulative.py as described in my previous comment. What do we want to do here?
  • in pandas/tests/extension there is no test_timedelta.py. I'm wondering if we want to have accumulation tests also here for datetimelikes?

Apart from that, and except for the typing errors, I guess we are almost done. Or am I missing something?

@jbrockmendel
Copy link
Member

in pandas/tests/extension there is no test_timedelta.py. I'm wondering if we want to have accumulation tests also here for datetimelikes?

this can either go in a new extensions/test_datetimelike.py file or in arrays/test_datetimelike.yp

cumprod fails for skin=False for float32 and integer arrays. Apparently this is due to buffer overruns. We could restrict the dataset to be a bit smaller. Or do you have another suggestion/opinion on what to do?

@jorisvandenbossche ?

@jreback
Copy link
Contributor

jreback commented Oct 4, 2021

closing as stale, if you want to continue working, please ping.

@jreback jreback closed this Oct 4, 2021
@phofl phofl mentioned this pull request Aug 16, 2022
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ExtensionArray Extending pandas with custom dtypes or arrays. Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nullable Int64 column changes type after some (cumsum) operations
8 participants