Skip to content

ENH: Handle extension arrays in algorithms.diff #31025

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 23 commits into from
Jan 23, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -998,6 +998,8 @@ Numeric
- Bug in :meth:`DataFrame.round` where a :class:`DataFrame` with a :class:`CategoricalIndex` of :class:`IntervalIndex` columns would incorrectly raise a ``TypeError`` (:issue:`30063`)
- Bug in :meth:`Series.pct_change` and :meth:`DataFrame.pct_change` when there are duplicated indices (:issue:`30463`)
- Bug in :class:`DataFrame` cumulative operations (e.g. cumsum, cummax) incorrect casting to object-dtype (:issue:`19296`)
- Bug in :class:`~DataFrame.diff` losing the dtype for extension types (:issue:`30889`)
- Bug in :class:`DataFrame.diff` raising an ``IndexError`` when one of the columns was a nullable integer dtype (:issue:`30967`)

Conversion
^^^^^^^^^^
Expand Down
3 changes: 3 additions & 0 deletions pandas/core/algorithms.py
Original file line number Diff line number Diff line change
Expand Up @@ -1834,6 +1834,9 @@ def diff(arr, n: int, axis: int = 0):
na = np.nan
dtype = arr.dtype

if is_extension_array_dtype(dtype):
return arr.diff(n)

is_timedelta = False
is_bool = False
if needs_i8_conversion(arr):
Expand Down
5 changes: 5 additions & 0 deletions pandas/core/arrays/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -578,6 +578,11 @@ def dropna(self):
"""
return self[~self.isna()]

def diff(self, periods: int = 1):
if hasattr(self, "__sub__"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not thrilled about this, but it may be unavoidable. Happy to hear of an alternative.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But if it doesn't support -, the operation will also raise a TypeError? What you are doing below yourself?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I meant the need for the hasattr just to please the type checker. If you do a diff on an EA that doens't implement __sub__ you'll get the TypeError either way.

return self - self.shift(periods)
raise TypeError()

def shift(self, periods: int = 1, fill_value: object = None) -> ABCExtensionArray:
"""
Shift values by desired number.
Expand Down
6 changes: 5 additions & 1 deletion pandas/core/arrays/numpy_.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
from pandas.core.dtypes.missing import isna

from pandas import compat
from pandas.core import nanops
from pandas.core import algorithms, nanops
from pandas.core.algorithms import searchsorted, take, unique
from pandas.core.arrays.base import ExtensionArray, ExtensionOpsMixin
import pandas.core.common as com
Expand Down Expand Up @@ -164,6 +164,10 @@ def _from_sequence(cls, scalars, dtype=None, copy=False):
result = result.copy()
return cls(result)

def diff(self, periods: int = 1):
result = algorithms.diff(com.values_from_object(self._ndarray), periods)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

com.values_from_object was the old behavior of Series.diff. So this should be identical to the old behavior, just with an extra function call to go through self.array.diff.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to override the base class version? Doesn't that work as well for numpy?

return type(self)(result)

@classmethod
def _from_factorized(cls, values, original):
return cls(values)
Expand Down
8 changes: 8 additions & 0 deletions pandas/core/internals/blocks.py
Original file line number Diff line number Diff line change
Expand Up @@ -1962,6 +1962,14 @@ class ObjectValuesExtensionBlock(ExtensionBlock):
Series[T].values is an ndarray of objects.
"""

def diff(self, n: int, axis: int = 1) -> List["Block"]:
# Block.shape vs. Block.values.shape mismatch
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More technical debt from the 1D arrays inside 2D blocks :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and this is on ObjectValuesExtensionBlock, but is only useful for PeriodArray. IntervalArray is the only other array to use this, and doesn't implement __sub__. Is it worth creating a PeriodBlock just for this (IMO no)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this only needed for ObjectValuesExtensionBlock, and not for ExtensionBlock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose that in principle, we can hit this from ExtensionBlock.

We hit the problem when going from a NonConsolidatable block type (like period) to a consolidatable one (like object). In that case, the values passed to make_block will be 1d, but the block is expecting them to be 2d.

In practice, I think that for most EAs, ExtensionArray.__sub__ will return another extension array. So while ExtensionBlock.diff isn't 100% correct for all EAs, I think trying to handle this case is not worth it. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So while ExtensionBlock.diff isn't 100% correct for all EAs, I think trying to handle this case is not worth it

It certainly seems fine to ignore this corner case for now.

# Do the op, get the object-dtype ndarray, and reshape
# to put into an ObjectBlock
new_values = algos.diff(self.values, n, axis=axis)
new_values = np.atleast_2d(new_values)
return [self.make_block(values=new_values)]

def external_values(self, dtype=None):
return self.values.astype(object)

Expand Down
2 changes: 1 addition & 1 deletion pandas/core/series.py
Original file line number Diff line number Diff line change
Expand Up @@ -2316,7 +2316,7 @@ def diff(self, periods=1) -> "Series":
5 NaN
dtype: float64
"""
result = algorithms.diff(com.values_from_object(self), periods)
result = algorithms.diff(self.array, periods)
return self._constructor(result, index=self.index).__finalize__(self)

def autocorr(self, lag=1) -> float:
Expand Down
22 changes: 22 additions & 0 deletions pandas/tests/extension/base/methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -231,6 +231,28 @@ def test_container_shift(self, data, frame, periods, indices):

compare(result, expected)

@pytest.mark.parametrize("periods", [1, -2])
def test_diff(self, data, periods):
data = data[:5]
try:
# does this array implement ops?
data - data
except Exception:
pytest.skip(f"{type(data)} does not support diff")
s = pd.Series(data)
result = s.diff(periods)
expected = pd.Series(data - data.shift(periods))
self.assert_series_equal(result, expected)

df = pd.DataFrame({"A": data, "B": [1.0] * 5})
result = df.diff(periods)
if periods == 1:
b = [np.nan, 0, 0, 0, 0]
else:
b = [0, 0, 0, np.nan, np.nan]
expected = pd.DataFrame({"A": expected, "B": b})
tm.assert_frame_equal(result, expected)

@pytest.mark.parametrize(
"periods, indices",
[[-4, [-1, -1]], [-1, [1, -1]], [0, [0, 1]], [1, [-1, 0]], [4, [-1, -1]]],
Expand Down