-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add IntervalDtype support to IntervalIndex.astype #19231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -710,7 +710,8 @@ def __eq__(self, other): | |||
# None should match any subtype | |||
return True | |||
else: | |||
return self.subtype == other.subtype | |||
from pandas.core.dtypes.common import is_dtype_equal | |||
return is_dtype_equal(self.subtype, other.subtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change was necessary since some subtypes don't nicely compare:
In [2]: dtype1 = IntervalDtype('float64')
In [3]: dtype2 = IntervalDtype('datetime64[ns, US/Eastern]')
In [4]: dtype1 == dtype2
---------------------------------------------------------------------------
TypeError: data type not understood
Though interestingly enough the reverse comparison was fine:
In [5]: dtype2 == dtype1
Out[5]: False
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is because numpy types don't handle comparisons vs pandas dtypes.
@@ -710,7 +710,8 @@ def __eq__(self, other): | |||
# None should match any subtype | |||
return True | |||
else: | |||
return self.subtype == other.subtype | |||
from pandas.core.dtypes.common import is_dtype_equal | |||
return is_dtype_equal(self.subtype, other.subtype) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is because numpy types don't handle comparisons vs pandas dtypes.
if is_interval_dtype(dtype): | ||
return self.copy() if copy else self | ||
dtype = pandas_dtype(dtype) | ||
if is_interval_dtype(dtype) and dtype != self.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you could do: not is_dtype(equal(dtype, self.dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this needs to be a bit more specific than not is_dtype_equal
; what this is really looking for is an IntervalDtype
with a different subtype than the existing dtype. Just having not is_dtype_equal
would allow non-interval dtypes to pass through as well.
I could make the "different subtype" part more explicit with and not is_dtype_equal(dtype.subtype, self.dtype.subtype)
. Though this would be a partial dupe of the IntervalDtype.__eq__
logic. Could also just add a comment noting the "different subtype" part, if that would be preferable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I c, this is ok then. these are guaranteed to be II types and thus they are comparable
index.astype(dtype) | ||
|
||
# float64 -> integer-like fails with non-integer valued floats | ||
index = interval_range(0.0, 10.0, freq=0.25) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
numpy actually allows things like this (IOW truncation of floats). Are you failing this directly?
I agree that it should fail though, its just a can of worms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is test is being xfailed with @pytest.mark.xfail(reason='GH 15832')
. I think fixing #15832 should resolve this, since the changes should flow through to IntervalIndex
as well, since it's relying on Float64Index
to hold the endpoints.
The current behavior within this PR is truncation of floats, identical to the existing Float64Index
behavior:
In [2]: ii = pd.interval_range(0.25, 1.5, freq=0.25)
In [3]: ii
Out[3]:
IntervalIndex([(0.25, 0.5], (0.5, 0.75], (0.75, 1.0], (1.0, 1.25], (1.25, 1.5]]
closed='right',
dtype='interval[float64]')
In [4]: ii.astype('interval[int64]')
Out[4]:
IntervalIndex([(0, 0], (0, 0], (0, 1], (1, 1], (1, 1]]
closed='right',
dtype='interval[int64]')
In [5]: pd.Float64Index([1.1, 2.2, 3.3]).astype('int64')
Out[5]: Int64Index([1, 2, 3], dtype='int64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok that's fine
closed=index.closed) | ||
tm.assert_index_equal(result, expected) | ||
|
||
@pytest.mark.parametrize('subtype_start, subtype_end', [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some tests where we fail on int/unit sub-types that we don't support e.g. (int8,16,32)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we have these for construction as well (tests I mean)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There aren't any construction related tests for unsupported int/uint. This also doesn't fail with how I've implemented IntervalIndex.astype
in this PR, since it relies on the behavior of the underlying endpoint indexes. I can't find a way in which those raise when passed an unsupported int/uint, as they always appear to upcast to int64
:
In [2]: pd.Index([1, 2, 3], dtype='int16')
Out[2]: Int64Index([1, 2, 3], dtype='int64')
In [3]: pd.Int64Index([1, 2, 3]).astype('int8')
Out[3]: Int64Index([1, 2, 3], dtype='int64')
In [4]: pd.Index(np.arange(5, dtype='int16'))
Out[4]: Int64Index([0, 1, 2, 3, 4], dtype='int64')
In [5]: pd.Int64Index([1, 2, 3]).astype('uint8')
Out[5]: UInt64Index([1, 2, 3], dtype='uint64')
Seems like this would need to fail at the level shown above for it to fail on IntervalIndex
? If so, probably best done in a separate PR, since my initial guess is that it'd require a bit of work/changes. I could write xfailing tests here though. Or am I misinterpreting what you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right these are all valid (as these are how indexes are constructed), maybe just add some tests to confirm for II
Codecov Report
@@ Coverage Diff @@
## master #19231 +/- ##
==========================================
- Coverage 91.55% 91.53% -0.02%
==========================================
Files 147 147
Lines 48797 48805 +8
==========================================
- Hits 44675 44673 -2
- Misses 4122 4132 +10
Continue to review full report at Codecov.
|
thanks @jschendel nice patch! |
git diff upstream/master -u -- "*.py" | flake8 --diff
Summary:
IntervalIndex.astype(IntervalDtype(...))
interval[int64]
-->interval[float64]
/interval/test_astype.py
file for the tests/datetimes
, and/timedeltas
astype
related tests forIntervalIndex
closed
, since it doesn't really come into play forastype
, though each variation of closed is accounted for among the indexes testedIntervalDtype
class when an invalid subtype comparison is made