Skip to content

[WIP, ENH] Adds cumulative methods to ea #28509

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
89 commits
Select commit Hold shift + click to select a range
2897723
Merge pull request #1 from pandas-dev/master
datajanko Sep 13, 2019
a54e1b4
Merge branch 'master' of https://github.com/pandas-dev/pandas
datajanko Sep 14, 2019
10abc0f
Merge branch 'master' of https://github.com/pandas-dev/pandas
datajanko Sep 15, 2019
c2d7592
define accumulation interface for ExtensionArrays
datajanko Sep 18, 2019
2c149c0
reformulate doc string
datajanko Sep 19, 2019
79cea11
creates baseExtension tests for accumulate
datajanko Sep 19, 2019
12a5ca3
adds fixtures for numeric_accumulations
datajanko Oct 4, 2019
dc959f4
fixes typos
datajanko Nov 13, 2019
bcfb8a8
adds accumulate tests for integer arrays
datajanko Dec 10, 2019
9a8f4ec
fixes typo
datajanko Dec 12, 2019
5d837d9
first implementation of cumsum
datajanko Jan 9, 2020
9e9f0c3
Merge pull request #2 from pandas-dev/master
datajanko Feb 12, 2020
a1a1cb2
Merge pull request #3 from pandas-dev/master
datajanko Mar 15, 2020
6d967ad
merges master
datajanko Mar 15, 2020
73363bf
stashed merge conflict
datajanko Mar 15, 2020
0d9a3d5
fixes formatting
datajanko Mar 15, 2020
84a7d81
first green test for integer extension arrays and cumsum
datajanko Mar 23, 2020
ce6869d
first passing tests for cummin and cummax
datajanko Apr 2, 2020
3b5d1d8
utilizes na_accum_func
datajanko Apr 5, 2020
0337cb0
removes delegation leftover
datajanko Apr 5, 2020
f0722f5
creates running tests
datajanko Apr 9, 2020
99baa1b
Merge branch 'master' into 28385-add-cumulative-methods-to-EA
datajanko Apr 9, 2020
fa35b14
removes ABCExtensionArray Type hint
datajanko Apr 9, 2020
43fca7c
Merge pull request #4 from pandas-dev/master
datajanko Apr 10, 2020
7bd6378
Merge branch 'master' into 28385-add-cumulative-methods-to-EA
datajanko Apr 10, 2020
185510b
removes clutter from generic.py
datajanko Apr 10, 2020
2ef9ebb
removes clutter in _accumulate
datajanko Apr 10, 2020
7d898bd
adds typehints for ExtensionArray and IntegerArray
datajanko Apr 10, 2020
09b42be
delegates the accumulate calls to extension arrays
datajanko Apr 10, 2020
af0dd24
removes diff in nanops
datajanko Apr 10, 2020
bc9a36a
removes unwanted pattern
datajanko Apr 10, 2020
38454a3
makes output types for sum and prod explicit
datajanko Apr 12, 2020
5ecfa51
makes the base accumulate test more general by not comparing types
datajanko Apr 13, 2020
8d62594
implements accumulation for boolean arrays
datajanko Apr 13, 2020
5f3b624
uses f-string in base.py
datajanko Apr 26, 2020
06d1286
uses blockmanager also for extension arrays
datajanko May 2, 2020
7efcb5f
Merge branch 'master' of https://github.com/pandas-dev/pandas
datajanko May 3, 2020
ae5f969
merges master
datajanko May 3, 2020
f7e3f4f
fixes flake8 issues
datajanko May 3, 2020
aa99927
Merge branch 'master' of https://github.com/pandas-dev/pandas
datajanko Jun 15, 2020
9cab6d9
merges master
datajanko Jun 15, 2020
b3ae864
removes uncommented code
datajanko Jun 17, 2020
52e6486
adds todo for runtime warning
datajanko Jun 17, 2020
99fb664
reuses integer array to accumulate for booleans
datajanko Jun 22, 2020
d339250
removes runtimewarning catching
datajanko Jun 22, 2020
be6f974
removes TODOs
datajanko Jun 23, 2020
a902f4e
adds accumulate to autosummary
datajanko Jun 23, 2020
64afb5b
excludes datetime from propagating to _accumulate
datajanko Jun 24, 2020
1e5d77b
uses pandas.testing instead of pandas.util.testing in accumulate
datajanko Jun 29, 2020
c95b490
replaces assert_almost_equal with assert_series_equal
datajanko Jun 30, 2020
dc669de
dtypes to lowercase
datajanko Jun 30, 2020
08475a4
lowercase of uint and int64 dtype in _accumulate
datajanko Jun 30, 2020
67fa99a
uses hint of @simonjayhawkins concerning assert series equals
datajanko Jul 21, 2020
a36632b
Merge branch 'master' into 28385-add-cumulative-methods-to-EA
datajanko Jul 21, 2020
b3d3c81
adds whatsnew entry
datajanko Jul 25, 2020
f8f6367
Merge branch 'master' of https://github.com/pandas-dev/pandas
datajanko Jul 25, 2020
ad6773d
Merge branch 'master' into 28385-add-cumulative-methods-to-EA
datajanko Jul 25, 2020
663c301
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
datajanko Aug 10, 2020
e17f3a0
Merge branch 'master' into 28385-add-cumulative-methods-to-EA
datajanko Aug 10, 2020
8cb66f9
moves changes to 1.2.0
datajanko Aug 10, 2020
4f953cf
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
datajanko Aug 11, 2020
b33a5df
Merge branch 'master' into 28385-add-cumulative-methods-to-EA
datajanko Aug 11, 2020
18ec178
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
datajanko Sep 18, 2020
305bdc7
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
datajanko Sep 22, 2020
56bfb23
merges master
datajanko Sep 22, 2020
63db854
merges master
datajanko Oct 31, 2020
9c91c55
Merge branch 'master' of https://github.com/pandas-dev/pandas into ma…
datajanko Oct 31, 2020
6ba3ca9
uses na_accum_func
datajanko Nov 5, 2020
f2a49b3
Merge branch 'master' of https://github.com/pandas-dev/pandas
datajanko Jan 11, 2021
386fa39
merges master
datajanko Jan 12, 2021
55de384
delegate to EAs _accumulate function in block mgr
datajanko Jan 16, 2021
6a5b7f8
moves implementation from nanops to masked_accumulations
datajanko Jan 19, 2021
9c63c64
fixes typing annotations in base and masked
datajanko Jan 21, 2021
84dd141
Merge branch 'master' into 28385-add-cumulative-methods-to-EA
datajanko Jan 22, 2021
2f23499
fixes merge error
datajanko Jan 22, 2021
a5b30e6
fills na values without nanops
datajanko Jan 22, 2021
d22c8a0
fixes incorrect call to cumsum and changes to cumprod
datajanko Jan 25, 2021
a5866c7
add _accumulate to boolean
datajanko Jan 25, 2021
8255457
makes tests a lot easier - cumprod tests still fail
datajanko Jan 25, 2021
483b608
adds BaseNumericAccumulation for floating masked array
datajanko Jan 26, 2021
150fd3b
tests no numeric accumulations according to _accumulate interface
datajanko Jan 26, 2021
80e2dc6
uses NotImplementedError in base accumulate function
datajanko Jan 28, 2021
dceab99
ensures the fill values are data independent
datajanko Feb 16, 2021
1c14f18
adds accumulation for datetimelikes
datajanko Feb 16, 2021
e20501a
Merge branch 'master' of https://github.com/pandas-dev/pandas
datajanko Feb 16, 2021
53147c4
fixes merge conflicts
datajanko Feb 16, 2021
597e978
actually ads datetimelike accumulation algos
datajanko Feb 16, 2021
5ebe8ea
fixes absolute imports
datajanko Feb 16, 2021
32367c0
changes error to catch to adhere to changed implementation
datajanko Feb 20, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/reference/extensions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ objects.
.. autosummary::
:toctree: api/

api.extensions.ExtensionArray._accumulate
api.extensions.ExtensionArray._concat_same_type
api.extensions.ExtensionArray._formatter
api.extensions.ExtensionArray._from_factorized
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -342,6 +342,7 @@ Other enhancements
- ``compute.use_numba`` now exists as a configuration option that utilizes the numba engine when available (:issue:`33966`, :issue:`35374`)
- :meth:`Series.plot` now supports asymmetric error bars. Previously, if :meth:`Series.plot` received a "2xN" array with error values for ``yerr`` and/or ``xerr``, the left/lower values (first row) were mirrored, while the right/upper values (second row) were ignored. Now, the first row represents the left/lower error values and the second row the right/upper error values. (:issue:`9536`)


.. ---------------------------------------------------------------------------

.. _whatsnew_110.notable_bug_fixes:
Expand Down
11 changes: 11 additions & 0 deletions pandas/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -1005,6 +1005,17 @@ def all_logical_operators(request):
return request.param


_all_numeric_accumulations = ["cumsum", "cumprod", "cummin", "cummax"]


@pytest.fixture(params=_all_numeric_accumulations)
def all_numeric_accumulations(request):
"""
Fixture for numeric accumulation names
"""
return request.param


# ----------------------------------------------------------------
# Data sets/files
# ----------------------------------------------------------------
Expand Down
69 changes: 69 additions & 0 deletions pandas/core/array_algos/datetimelike_accumulations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
from typing import Callable

import numpy as np

from pandas._libs import iNaT

from pandas.core.dtypes.missing import isna

"""
datetimelke_accumulations.py is for accumulations of datetimelike extension arrays
"""


def _cum_func(
func: Callable,
values: np.ndarray,
*,
skipna: bool = True,
):
"""
Accumulations for 1D datetimelike arrays.

Parameters
----------
func : np.cumsum, np.cumprod, np.maximum.accumulate, np.minimum.accumulate
values : np.ndarray
Numpy array with the values (can be of any dtype that support the
operation).
skipna : bool, default True
Whether to skip NA.
"""
try:
fill_value = {
np.cumprod: 1,
np.maximum.accumulate: np.iinfo(np.int64).min,
np.cumsum: 0,
np.minimum.accumulate: np.iinfo(np.int64).max,
}[func]
except KeyError:
raise ValueError(f"No accumulation for {func} implemented on BaseMaskedArray")

mask = isna(values)
y = values.view("i8")
y[mask] = fill_value

if not skipna:
# This is different compared to the recent implementation for datetimelikes
# but is the same as the implementation for masked arrays
mask = np.maximum.accumulate(mask)

result = func(y)
result[mask] = iNaT
return result


def cumsum(values: np.ndarray, *, skipna: bool = True):
return _cum_func(np.cumsum, values, skipna=skipna)


def cumprod(values: np.ndarray, *, skipna: bool = True):
return _cum_func(np.cumprod, values, skipna=skipna)


def cummin(values: np.ndarray, *, skipna: bool = True):
return _cum_func(np.minimum.accumulate, values, skipna=skipna)


def cummax(values: np.ndarray, *, skipna: bool = True):
return _cum_func(np.maximum.accumulate, values, skipna=skipna)
78 changes: 78 additions & 0 deletions pandas/core/array_algos/masked_accumulations.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
from typing import Callable

import numpy as np

from pandas.core.dtypes.common import (
is_float_dtype,
is_integer_dtype,
)

"""
masked_accumulations.py is for accumulation algorithms using a mask-based approach
for missing values.
"""


def _cum_func(
func: Callable,
values: np.ndarray,
mask: np.ndarray,
*,
skipna: bool = True,
):
"""
Accumulations for 1D masked array.

Parameters
----------
func : np.cumsum, np.cumprod, np.maximum.accumulate, np.minimum.accumulate
values : np.ndarray
Numpy array with the values (can be of any dtype that support the
operation).
mask : np.ndarray
Boolean numpy array (True values indicate missing values).
skipna : bool, default True
Whether to skip NA.
"""
dtype_info = None
if is_float_dtype(values):
dtype_info = np.finfo(values.dtype.type)
elif is_integer_dtype(values):
dtype_info = np.iinfo(values.dtype.type)
else:
raise NotImplementedError(
f"No masked accumulation defined for dtype {values.dtype.type}"
)
try:
fill_value = {
np.cumprod: 1,
np.maximum.accumulate: dtype_info.min,
np.cumsum: 0,
np.minimum.accumulate: dtype_info.max,
}[func]
except KeyError:
raise ValueError(f"No accumulation for {func} implemented on BaseMaskedArray")

values[mask] = fill_value

if not skipna:
mask = np.maximum.accumulate(mask)

values = func(values)
return values, mask


def cumsum(values: np.ndarray, mask: np.ndarray, *, skipna: bool = True):
return _cum_func(np.cumsum, values, mask, skipna=skipna)


def cumprod(values: np.ndarray, mask: np.ndarray, *, skipna: bool = True):
return _cum_func(np.cumprod, values, mask, skipna=skipna)


def cummin(values: np.ndarray, mask: np.ndarray, *, skipna: bool = True):
return _cum_func(np.minimum.accumulate, values, mask, skipna=skipna)


def cummax(values: np.ndarray, mask: np.ndarray, *, skipna: bool = True):
return _cum_func(np.maximum.accumulate, values, mask, skipna=skipna)
35 changes: 34 additions & 1 deletion pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -108,6 +108,7 @@ class ExtensionArray:
take
unique
view
_accumulate
_concat_same_type
_formatter
_from_factorized
Expand Down Expand Up @@ -157,8 +158,9 @@ class ExtensionArray:
as they only compose abstract methods. Still, a more efficient
implementation may be available, and these methods can be overridden.

One can implement methods to handle array reductions.
One can implement methods to handle array accumulations or reductions.

* _accumulate
* _reduce

One can implement methods to handle parsing from strings that will be used
Expand Down Expand Up @@ -1253,6 +1255,37 @@ def _concat_same_type(
# of objects
_can_hold_na = True

def _accumulate(
self: ExtensionArray, name: str, *, skipna=True, **kwargs
) -> ExtensionArray:
"""
Return an ExtensionArray performing an accumulation operation.
The underlying data type might change

Parameters
----------
name : str
Name of the function, supported values are:
- cummin
- cummax
- cumsum
- cumprod
skipna : bool, default True
If True, skip NA values.
**kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be for numpy compatibility (axis, dtype, out).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your expectations adding this? Do you purely want to have it (for this ticket) as accessor, which we'll ignore? If not, I'd guess the impact of axis is None, but dtype might be interesting. I.e for cumsum and integer dtypes, we could also provide the target output type, so not defaulting to (U)Int64. But I don't know if this should be part of this issue

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be for something like np.cumsum(pd.array([1, 2])) to return an IntegerArray. I'm not sure what all is required for that to work.

Additional keyword arguments passed to the accumulation function.
Currently, there is no supported kwarg.

Returns
-------
array

Raises
------
NotImplementedError : subclass does not define accumulations
"""
raise NotImplementedError(f"cannot perform {name} with type {self.dtype}")

def _reduce(self, name: str, *, skipna: bool = True, **kwargs):
"""
Return a scalar result of performing the reduction operation.
Expand Down
9 changes: 9 additions & 0 deletions pandas/core/arrays/boolean.py
Original file line number Diff line number Diff line change
Expand Up @@ -712,6 +712,15 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs):

return super()._reduce(name, skipna=skipna, **kwargs)

def _accumulate(
self, name: str, *, skipna: bool = True, **kwargs
) -> BaseMaskedArray:
from pandas.core.arrays import IntegerArray

data = self._data.astype(int)
mask = self._mask
return IntegerArray(data, mask)._accumulate(name, skipna=skipna, **kwargs)

def _maybe_mask_result(self, result, mask, other, op_name: str):
"""
Parameters
Expand Down
17 changes: 17 additions & 0 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -93,6 +93,7 @@
isin,
unique1d,
)
from pandas.core.array_algos import datetimelike_accumulations
from pandas.core.arraylike import OpsMixin
from pandas.core.arrays._mixins import (
NDArrayBackedExtensionArray,
Expand Down Expand Up @@ -1204,6 +1205,22 @@ def _time_shift(self, periods, freq=None):
# to be passed explicitly.
return self._generate_range(start=start, end=end, periods=None, freq=self.freq)

def _accumulate(
self, name: str, *, skipna: bool = True, **kwargs
) -> DatetimeLikeArrayT:

data = self._data.copy()

if name in {"cummin", "cummax"}:
op = getattr(datetimelike_accumulations, name)
data = op(data, skipna=skipna, **kwargs)

return type(self)._simple_new(data, freq=self.freq, dtype=self.dtype)

raise NotImplementedError(
f"Accumlation {name} not implemented for {type(self)}"
)

@unpack_zerodim_and_defer("__add__")
def __add__(self, other):
other_dtype = getattr(other, "dtype", None)
Expand Down
21 changes: 20 additions & 1 deletion pandas/core/arrays/masked.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,10 @@
isin,
take,
)
from pandas.core.array_algos import masked_reductions
from pandas.core.array_algos import (
masked_accumulations,
masked_reductions,
)
from pandas.core.arraylike import OpsMixin
from pandas.core.arrays import ExtensionArray
from pandas.core.indexers import check_array_indexer
Expand Down Expand Up @@ -457,3 +460,19 @@ def _reduce(self, name: str, *, skipna: bool = True, **kwargs):
return libmissing.NA

return result

def _accumulate(
self, name: str, *, skipna: bool = True, **kwargs
) -> BaseMaskedArray:
data = self._data
mask = self._mask

if name in {"cumsum", "cumprod", "cummin", "cummax"}:
op = getattr(masked_accumulations, name)
data, mask = op(data, mask, skipna=skipna, **kwargs)

return type(self)(data, mask, copy=False)

raise NotImplementedError(
"Accumlation {name} not implemented for BaseMaskedArray"
)
19 changes: 19 additions & 0 deletions pandas/core/arrays/timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@

from pandas.core import nanops
from pandas.core.algorithms import checked_add_with_arr
from pandas.core.array_algos import datetimelike_accumulations
from pandas.core.arrays import (
IntegerArray,
datetimelike as dtl,
Expand Down Expand Up @@ -403,6 +404,24 @@ def std(
return self._box_func(result)
return self._from_backing_data(result)

# ----------------------------------------------------------------
# Accumulations

def _accumulate(
self, name: str, *, skipna: bool = True, **kwargs
) -> TimedeltaArray:

data = self._data.copy()

if name in {"cumsum", "cumsum"}:
op = getattr(datetimelike_accumulations, name)
data = op(data, skipna=skipna, **kwargs)

return type(self)._simple_new(data, freq=None, dtype=self.dtype)

else:
return super()._accumulate(name, skipna=skipna, **kwargs)

# ----------------------------------------------------------------
# Rendering Methods

Expand Down
9 changes: 8 additions & 1 deletion pandas/core/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -10320,7 +10320,14 @@ def _accum_func(self, name: str, func, axis=None, skipna=True, *args, **kwargs):
def block_accum_func(blk_values):
values = blk_values.T if hasattr(blk_values, "T") else blk_values

result = nanops.na_accum_func(values, func, skipna=skipna)
from pandas.core.construction import ensure_wrapped_if_datetimelike

values = ensure_wrapped_if_datetimelike(values)

if isinstance(values, ExtensionArray):
result = values._accumulate(name, skipna=skipna, **kwargs)
else:
result = nanops.na_accum_func(values, func, skipna=skipna)

result = result.T if hasattr(result, "T") else result
return result
Expand Down
4 changes: 4 additions & 0 deletions pandas/tests/extension/base/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,10 @@ class TestMyDtype(BaseDtypeTests):
``assert_series_equal`` on your base test class.

"""
from pandas.tests.extension.base.accumulate import ( # noqa
BaseNoAccumulateTests,
BaseNumericAccumulateTests,
)
from pandas.tests.extension.base.casting import BaseCastingTests # noqa
from pandas.tests.extension.base.constructors import BaseConstructorsTests # noqa
from pandas.tests.extension.base.dtype import BaseDtypeTests # noqa
Expand Down
Loading