ENH: Add cumulative methods to ea #48111

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

mroeschke merged 129 commits into pandas-dev:main from phofl:28385-add-cumulative-methods-to-EA

Dec 13, 2022

Member

phofl commented Aug 16, 2022

closes Nullable Int64 column changes type after some (cumsum) operations #28385 (Replace xxxx with the Github issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This tries to finish #28509. I've reduced the scope to masked arrays and would prefer to do datetimelike as a follow up, if we want to change the behavior there. This is currently a big buggy as mentioned in #28509 (comment)
Currently, we are dispatching back to the current implementation.

I was wondering about the exact interface. Currently, you'd have to call _accumulate with the name of the function, since cumsum etc. are not really added to the interface, meaning

arr = pd.array([1, 2, pd.NA], dtype="Int64")
arr.cumsum()

raises, because the method is not registered. I this intended?

datajanko and others added 30 commits

September 13, 2019 20:19


          Merge pull request #1 from pandas-dev/master

updates pandas


          Merge branch 'master' of https://github.com/pandas-dev/pandas

a54e1b4


          Merge branch 'master' of https://github.com/pandas-dev/pandas

10abc0f


          define accumulation interface for ExtensionArrays

c2d7592


          reformulate doc string

2c149c0


          creates baseExtension tests for accumulate

79cea11


          adds fixtures for numeric_accumulations

12a5ca3


          fixes typos

dc959f4


          adds accumulate tests for integer arrays

bcfb8a8


          fixes typo

9a8f4ec


          first implementation of cumsum

5d837d9


          Merge pull request #2 from pandas-dev/master

9e9f0c3

Updates fork


          Merge pull request #3 from pandas-dev/master

a1a1cb2

updates from upstream


          merges master

6d967ad


          stashed merge conflict

73363bf


          fixes formatting

0d9a3d5


          first green test for integer extension arrays and cumsum

84a7d81


          first passing tests for cummin and cummax

ce6869d


          utilizes na_accum_func

3b5d1d8


          removes delegation leftover

0337cb0


          creates running tests

f0722f5


          Merge branch 'master' into 28385-add-cumulative-methods-to-EA

99baa1b


          removes ABCExtensionArray Type hint

fa35b14


          Merge pull request #4 from pandas-dev/master

43fca7c

merges upstream


          Merge branch 'master' into 28385-add-cumulative-methods-to-EA

7bd6378


          removes clutter from generic.py

185510b


          removes clutter in _accumulate

2ef9ebb


          adds typehints for ExtensionArray and IntegerArray

7d898bd


          delegates the accumulate calls to extension arrays

09b42be


          removes diff in nanops

af0dd24

jbrockmendel reviewed

View reviewed changes

pandas/core/arrays/datetimelike.py Outdated Show resolved Hide resolved

jbrockmendel reviewed

View reviewed changes

pandas/core/arrays/datetimelike.py

+                      data = self._data.copy()
+                      if name in {"cummin", "cummax"}:
+                          func = np.minimum.accumulate if name == "cummin" else np.maximum.accumulate

Member

jbrockmendel Nov 24, 2022

do the numpy functions not work directly?

Member Author

phofl Nov 29, 2022

This has a different behavior than going through nanops. The initial pr included this but I ripped it out to keep it a bit more focused. Plan to tackle this afterwards, but we have to decide what we actually want here first. Hence only doing masked ops here

Member

jbrockmendel Nov 30, 2022

This has a different behavior than going through nanops

can you expand on this?

Member Author

phofl Nov 30, 2022

skipna is the problem, difference in behavior with regards to floats, remembered this incorrect.

See #28509 (comment)

Wanted to investigate this as a follow up when we can check this more isolated

jbrockmendel reviewed

View reviewed changes

pandas/core/arrays/datetimelike.py Outdated Show resolved Hide resolved

jbrockmendel reviewed

View reviewed changes

pandas/core/arrays/timedeltas.py Show resolved Hide resolved

jbrockmendel reviewed

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

jbrockmendel reviewed

View reviewed changes

pandas/core/generic.py Outdated Show resolved Hide resolved

phofl added 5 commits

November 29, 2022 21:37


          Move to top of file

a6a974a


          Change error

e7364bd


          Change _data

4ff6e4d


          Remove

57abcc3


          Merge remote-tracking branch 'upstream/main' into 28385-add-cumulativ…

1b3771e

…e-methods-to-EA

# Conflicts:
#	pandas/core/arrays/datetimelike.py

Member Author

phofl commented Nov 29, 2022

Should have addressed everything

jbrockmendel reviewed

View reviewed changes

pandas/core/arrays/datetimelike.py Outdated Show resolved Hide resolved

jbrockmendel reviewed

View reviewed changes

pandas/core/arrays/datetimelike.py

+                      if name in {"cummin", "cummax"}:
+                          func = np.minimum.accumulate if name == "cummin" else np.maximum.accumulate
+                          result = cast(np.ndarray, nanops.na_accum_func(data, func, skipna=skipna))

Member

jbrockmendel Nov 30, 2022

this looks like it might choke on PeriodDtype?

Member Author

phofl Nov 30, 2022

Seems to work, could you elaborate what you suspect?

ser = Series([pd.Period('2012-1-1', freq='D'), pd.Period('2013-1-1', freq='D')])
ser.cummin()

Member

jbrockmendel Nov 30, 2022

i think this goes wrong when there's a NaT present

Member Author

phofl Nov 30, 2022

This would restore the previous behavior (but the previous behavior was wrong as well...). Are you ok with addressing this in a follow up?

Member

jbrockmendel Dec 5, 2022

This would restore the previous behavior

previous as of when? IIRC this was last changed multiple years ago.

but the previous behavior was wrong as well

so does this PR get the behavior right for dt64/td64? If so, the solution for PeriodDtype is similar to median/min/max to do a view to dt64, do the op, then view back.

Member Author

phofl Dec 7, 2022 •

edited

Loading

Sorry, my explanation was a bit confusing.

The initial pr which this work is based on tried the following:

add accumulate to the ea interface
implement masked-based accumulators
implement new accumulation logic for date time and related arrays

This got confusing, since it was pretty big and the new date time logic changed the behavior. So I made the decision to reduce scope and only tackle the first 2 points and defer the third to a follow up.

To achieve this, I just send the date time logic through nanops again (this is what I meant with previous behavior, previous as in before _accumulate was added to the ea interface). This avoid any change in behavior for the date time was and should keep review more focused.

Member

jbrockmendel Dec 12, 2022

Let's try this one more time, thanks for your patience in explaining this to me.

IIUC this PR should not change any behavior for dt64/dt64tz/td64 dtypes, correct?

Member Author

phofl Dec 12, 2022 •

edited

Loading

Yes, correct. Want to do those changes as follow ups

Member

jbrockmendel Dec 12, 2022

Cool, ill trust you to handle these.

Member Author

phofl Dec 12, 2022

Thx, yep want to get this into 2.0 as well

jbrockmendel reviewed

View reviewed changes

pandas/tests/extension/base/accumulate.py Outdated Show resolved Hide resolved

phofl added 6 commits

November 30, 2022 22:01


          Add todo


          Fix typo

c770872


          Adjust var

cb7277b


          Special case

797e724


          Merge remote-tracking branch 'upstream/main' into 28385-add-cumulativ…

4f8b06a

…e-methods-to-EA


          Fix tests

ab3cf7e

jbrockmendel reviewed

View reviewed changes

pandas/tests/extension/base/accumulate.py Outdated

		self.assert_series_equal(result, expected, check_dtype=False)


		class BaseNoAccumulateTests(BaseAccumulateTests):

Member

jbrockmendel Dec 12, 2022

i find this pattern really weird (xref #44742), is there a way to do this with just one class?

Member Author

phofl Dec 12, 2022 •

edited

Loading

Combined the classes. Have to overwrite the tests that should not get executed now.

Member

jbrockmendel Dec 12, 2022

thanks


          Combine classes

e1d2a4e

jbrockmendel approved these changes

View reviewed changes

Member

jbrockmendel left a comment

LGTM pending green

phofl and others added 2 commits

December 12, 2022 20:24


          Merge branch 'main' into 28385-add-cumulative-methods-to-EA

53eac54


          Fix mypy

e7dbd5f

mroeschke merged commit b5953aa into pandas-dev:main

Member

mroeschke commented Dec 13, 2022

Thanks @phofl

phofl deleted the 28385-add-cumulative-methods-to-EA branch

December 13, 2022 22:32

dhimmel mentioned this pull request

Not working with extension types has2k1/plotnine#529

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

NA - MaskedArrays Reduction Operations