Skip to content

IntervalDtype inconsistencies and bugs #18980

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jschendel opened this issue Dec 28, 2017 · 1 comment · Fixed by #18997
Closed

IntervalDtype inconsistencies and bugs #18980

jschendel opened this issue Dec 28, 2017 · 1 comment · Fixed by #18997
Labels
Categorical Categorical Data Type Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Interval Interval data type
Milestone

Comments

@jschendel
Copy link
Member

jschendel commented Dec 28, 2017

1. Inconsistent comparisons versus string 'interval':

In [2]: IntervalDtype() == 'interval'
Out[2]: True

In [3]: IntervalDtype('interval') == 'interval'
Out[3]: False

In [4]: IntervalDtype('int64') == 'interval'
Out[4]: False

I'd expect all of these to return True, like how CategoricalDtype(*, *) == 'category' always returns True.


2. Inconsistent comparisons versus IntervalDtype(None):

In [5]: IntervalDtype(None) == IntervalDtype('interval')
Out[5]: False

In [6]: IntervalDtype(None) == IntervalDtype('int64')
Out[6]: False

I'd expect all of these to return True, like how CDT(None, None) == CDT(*, *) always returns True.


3. IntervalDtype.name attribute changes

In [7]: IntervalDtype().name
Out[7]: 'interval'

In [8]: IntervalDtype('interval').name
Out[8]: 'interval[]'

In [9]: IntervalDtype('int64').name
Out[9]: 'interval[int64]'

CategoricalDtype.name attribute is always the same:

In [10]: CategoricalDtype(list('abc'), True).name
Out[10]: 'category'

In [11]: CategoricalDtype(list('wxyz'), False).name
Out[11]: 'category'

I'd expect IntervalDtype.name to always return 'interval', like how CDT.name always returns 'category'. This makes the code for checking equality against strings (i.e. what I described in 1) simpler. I don't think the behavior of str(IntervalDtype) should change, which is currently the same as IntervalDtype.name, so I'd still have that return strings specifying the subtype.


4. CategoricalDtype gets cached incorrectly: (No longer an issue due to #19022)

In [12]: idt1 = IntervalDtype(CategoricalDtype(list('abc'), True))

In [13]: idt2 = IntervalDtype(CategoricalDtype(list('wxyz'), False))

In [14]: idt2.subtype
Out[14]: CategoricalDtype(categories=['a', 'b', 'c'], ordered=True)

This looks to be caused by the caching being done by string representation, and str(CDT(*, *)) always returns 'category':

try:
return cls._cache[str(subtype)]
except KeyError:
u = object.__new__(cls)
u.subtype = subtype
cls._cache[str(subtype)] = u
return u

Can caching be removed entirely for IntervalDtype, or is there some need/advantage that I'm not seeing? Looking at the other dtypes, CategoricalDtype appears to have had the caching code removed, but PeriodDtype and DatetimeTZDtype are using it.

@jreback
Copy link
Contributor

jreback commented Dec 29, 2017

IIRC the reason for caching was:

a) perf (constructing a dtype each time involves some checks)
b) equality/hashing; the caching gives exactly the same object back so we can do is comparison. this is quite important as these are singular objects (though they are parameterized). IOW, IntervalDtype() should always be a singleton, and so on.

@jreback jreback added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions Interval Interval data type Compat pandas objects compatability with Numpy or Python functions Difficulty Intermediate labels Dec 29, 2017
@jreback jreback added this to the Next Major Release milestone Dec 29, 2017
@jreback jreback modified the milestones: Next Major Release, 0.23.0 Jan 7, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Compat pandas objects compatability with Numpy or Python functions Dtype Conversions Unexpected or buggy dtype conversions Interval Interval data type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants