Skip to content

Implement take for EA mixins #23159

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 10 commits into from
80 changes: 72 additions & 8 deletions pandas/core/arrays/datetimelike.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,8 @@
from pandas._libs.tslibs.period import (
Period, DIFFERENT_FREQ_INDEX, IncompatibleFrequency)

from pandas.errors import NullFrequencyError, PerformanceWarning
from pandas.errors import (
NullFrequencyError, PerformanceWarning, AbstractMethodError)
from pandas import compat

from pandas.tseries import frequencies
Expand All @@ -36,7 +37,7 @@
from pandas.core.dtypes.dtypes import DatetimeTZDtype

import pandas.core.common as com
from pandas.core.algorithms import checked_add_with_arr
from pandas.core.algorithms import checked_add_with_arr, take

from .base import ExtensionOpsMixin
from pandas.util._decorators import deprecate_kwarg
Expand Down Expand Up @@ -77,12 +78,10 @@ class AttributesMixin(object):
@property
def _attributes(self):
# Inheriting subclass should implement _attributes as a list of strings
from pandas.errors import AbstractMethodError
raise AbstractMethodError(self)

@classmethod
def _simple_new(cls, values, **kwargs):
from pandas.errors import AbstractMethodError
raise AbstractMethodError(cls)

def _get_attributes_dict(self):
Expand Down Expand Up @@ -119,7 +118,7 @@ def _box_func(self):
"""
box function to get object from internal representation
"""
raise com.AbstractMethodError(self)
raise AbstractMethodError(self)

def _box_values(self, values):
"""
Expand All @@ -140,6 +139,67 @@ def asi8(self):
# do not cache or you'll create a memory leak
return self.values.view('i8')

# ------------------------------------------------------------------
# Extension Array Interface
# TODO:
# _from_sequence
# _from_factorized
# __setitem__
# _values_for_argsort
# argsort
# fillna
# dropna
# shift
# unique
# _values_for_factorize
# factorize
# _formatting_values
# _reduce
# copy

def _validate_fill_value(self, fill_value):
"""
If a fill_value is passed to `take` convert it to an i8 representation,
raising ValueError if this is not possible.

Parameters
----------
fill_value : object

Returns
-------
fill_value : np.int64

Raises
------
ValueError
"""
raise AbstractMethodError(self)

def take(self, indices, allow_fill=False, fill_value=None):

if allow_fill:
fill_value = self._validate_fill_value(fill_value)

new_values = take(self._data,
indices,
allow_fill=allow_fill,
fill_value=fill_value)

# TODO: use "infer"? Why does not passing freq cause
# failures in py37 but not py27?
freq = self.freq if is_period_dtype(self) else None
return self._shallow_copy(new_values, freq=freq)

@classmethod
def _concat_same_type(cls, to_concat):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really tested for now, is that correct?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I am -1 on moving things out of the PeriodArray PR that cannot stand on its own

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming you're implicitly saying you'd be OK with this if tests were implemented as part of this PR, I think this is an excellent comment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, that is not really what I meant :-)
Because now you added tests for something that would already be tested in the PeriodArray PR. We don't need explicit tests for this method, as it is tested through all existing methods for concat, and it already has explicit tests in the base extension tests.

So I just think this simply belongs in a PR where we actually make PeriodArray an ExtensionArray.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that would already be tested in the PeriodArray PR.

The benefit of doing things in smaller chunks is that each chunk gets more focused. e.g. I don't want to throw shade on Tom's PR, but the take implementation missed fill_value that is a Period with non-matching freq. Doing a couple methods at a time lets both authors and reviewers focus more precisely.

And while "too many tests" does have a downside, it really isn't in the top 5 things I'm going to worry about today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but the take implementation missed fill_value that is a Period with non-matching freq

And you can review the PR in detail (which of course takes time), and have pointed that out there.

Yes, for sure smaller PRs are better, and to the extent possible, we should strive for that. I fully agree the huge PR makes it very difficult to do a proper review.

But getting out chunks that cannot live on its own can also be hard to review. And in this case it also means to need to completely duplicate tests, and also testing the implementation detail and not the end result.

it really isn't in the top 5 things I'm going to worry about today.

Agreeing on what the end result of this refactor will look like, and having a good view on that, is a worry for me. And so what I am saying is, that many small PRs also has a clear downside of making this harder.

# for TimedeltaArray and PeriodArray; DatetimeArray overrides
freqs = {x.freq for x in to_concat}
assert len(freqs) == 1
freq = list(freqs)[0]
values = np.concatenate([x._data for x in to_concat])
return cls._simple_new(values, freq=freq)

# ------------------------------------------------------------------
# Array-like Methods

Expand Down Expand Up @@ -211,6 +271,10 @@ def astype(self, dtype, copy=True):
# ------------------------------------------------------------------
# Null Handling

def isna(self):
# EA Interface
return self._isnan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as in the other PRs, I think we should keep this concept of _isnan on the Index (basically my question about it is: for the array classes, what is the advantages of having a _isnan instead of simply using isna() everywhere?)

(but since there are a lot of places here where it is used, I am fine with keeping _isna itself in here until we de the actual split)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a reasonable view, but one that I'd like to put off dealing with for as long as possible.

The eventual move to composition is going to cause a mismatch between what is and is not cacheable (like _isnan is now on the Index subclasses). Since the EA subclasses actually use _isnan in arithmetic/comparison ops, there is a performance hit if we're not careful. Since I haven't thought of a good solution to this, putting it off is the next best thing.


@property # NB: override with cache_readonly in immutable subclasses
def _isnan(self):
""" return if each value is nan"""
Expand Down Expand Up @@ -352,13 +416,13 @@ def _add_datelike(self, other):
typ=type(other).__name__))

def _sub_datelike(self, other):
raise com.AbstractMethodError(self)
raise AbstractMethodError(self)

def _sub_period(self, other):
return NotImplemented

def _add_offset(self, offset):
raise com.AbstractMethodError(self)
raise AbstractMethodError(self)

def _add_delta(self, other):
return NotImplemented
Expand All @@ -380,7 +444,7 @@ def _add_delta_tdi(self, other):
Add a delta of a TimedeltaIndex
return the i8 result view
"""
if not len(self) == len(other):
if len(self) != len(other):
raise ValueError("cannot add indices of unequal length")

if isinstance(other, np.ndarray):
Expand Down
31 changes: 30 additions & 1 deletion pandas/core/arrays/datetimes.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
conversion, fields, timezones,
resolution as libresolution)

from pandas.util._decorators import cache_readonly
from pandas.util._decorators import cache_readonly, Appender
from pandas.errors import PerformanceWarning
from pandas import compat

Expand Down Expand Up @@ -298,6 +298,35 @@ def _generate_range(cls, start, end, periods, freq, tz=None,

return cls._simple_new(index.values, freq=freq, tz=tz)

# ----------------------------------------------------------------
# Extension Array Interface

@Appender(dtl.DatetimeLikeArrayMixin._validate_fill_value.__doc__)
def _validate_fill_value(self, fill_value):
if isna(fill_value):
fill_value = iNaT
elif isinstance(fill_value, (datetime, np.datetime64)):
self._assert_tzawareness_compat(fill_value)
fill_value = Timestamp(fill_value).value
else:
raise ValueError("'fill_value' should be a Timestamp. "
"Got '{got}'.".format(got=fill_value))
return fill_value

@classmethod
def _concat_same_type(cls, to_concat):
# for TimedeltaArray and PeriodArray; DatetimeArray requires tz
freqs = {x.freq for x in to_concat}
assert len(freqs) == 1
freq = list(freqs)[0]

tzs = {x.tz for x in to_concat}
assert len(tzs) == 1
tz = list(tzs)[0]

values = np.concatenate([x._data for x in to_concat])
return cls._simple_new(values, freq=freq, tz=tz)

# -----------------------------------------------------------------
# Descriptive Properties

Expand Down
20 changes: 19 additions & 1 deletion pandas/core/arrays/period.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,14 @@
from pandas._libs.tslibs.fields import isleapyear_arr

from pandas import compat
from pandas.util._decorators import (cache_readonly, deprecate_kwarg)
from pandas.util._decorators import cache_readonly, deprecate_kwarg, Appender

from pandas.core.dtypes.common import (
is_integer_dtype, is_float_dtype, is_period_dtype, is_timedelta64_dtype,
is_datetime64_dtype, _TD_DTYPE)
from pandas.core.dtypes.dtypes import PeriodDtype
from pandas.core.dtypes.generic import ABCSeries
from pandas.core.dtypes.missing import isna

import pandas.core.common as com

Expand Down Expand Up @@ -193,6 +194,23 @@ def _generate_range(cls, start, end, periods, freq, fields):

return subarr, freq

# --------------------------------------------------------------------
# ExtensionArray Interface

@Appender(DatetimeLikeArrayMixin._validate_fill_value.__doc__)
def _validate_fill_value(self, fill_value):
if isna(fill_value):
fill_value = iNaT
elif isinstance(fill_value, Period):
if fill_value.freq != self.freq:
raise ValueError("'fill_value' freq must match own "
"freq ({freq})".format(freq=self.freq))
fill_value = fill_value.ordinal
else:
raise ValueError("'fill_value' should be a Period. "
"Got '{got}'.".format(got=fill_value))
return fill_value

# --------------------------------------------------------------------
# Vectorized analogues of Period properties

Expand Down
16 changes: 16 additions & 0 deletions pandas/core/arrays/timedeltas.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@
from pandas._libs.tslibs.fields import get_timedelta_field
from pandas._libs.tslibs.timedeltas import array_to_timedelta64

from pandas.util._decorators import Appender

from pandas import compat

from pandas.core.dtypes.common import (
Expand Down Expand Up @@ -180,6 +182,20 @@ def _generate_range(cls, start, end, periods, freq, closed=None, **kwargs):

return index

# ----------------------------------------------------------------
# Extension Array Interface

@Appender(dtl.DatetimeLikeArrayMixin._validate_fill_value.__doc__)
def _validate_fill_value(self, fill_value):
if isna(fill_value):
fill_value = iNaT
elif isinstance(fill_value, (timedelta, np.timedelta64, Tick)):
fill_value = Timedelta(fill_value).value
else:
raise ValueError("'fill_value' should be a Timedelta. "
"Got '{got}'.".format(got=fill_value))
return fill_value

# ----------------------------------------------------------------
# Arithmetic Methods

Expand Down
Loading