BUG/API: Unordered Categorical should ignore order in comparisons? #16014

TomAugspurger · 2017-04-16T03:04:45Z

I think that when comparing two unordered categorical-dtyped series with categories which differ only by ordered should compare equal

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd
c
In [2]: c1 = pd.Series(pd.Categorical(['a', 'b']))

In [3]: c2 = pd.Series(pd.Categorical(['a', 'b'], categories=['b', 'a']))

In [4]: c1
Out[4]:
0    a
1    b
dtype: category
Categories (2, object): [a, b]

In [5]: c2
Out[5]:
0    a
1    b
dtype: category
Categories (2, object): [b, a]

In [6]: c1 == c2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-d8d43a43a02a> in <module>()
----> 1 c1 == c2

/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
    811                 msg = 'Can only compare identically-labeled Series objects'
    812                 raise ValueError(msg)
--> 813             return self._constructor(na_op(self.values, other.values),
    814                                      index=self.index, name=name)
    815         elif isinstance(other, pd.DataFrame):  # pragma: no cover

/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
    752         # in either operand
    753         if is_categorical_dtype(x):
--> 754             return op(x, y)
    755         elif is_categorical_dtype(y) and not isscalar(y):
    756             return op(y, x)

/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/categorical.py in f(self, other)
     55             if ((len(self.categories) != len(other.categories)) or
     56                     not ((self.categories == other.categories).all())):
---> 57                 raise TypeError("Categoricals can only be compared if "
     58                                 "'categories' are the same")
     59             if not (self.ordered == other.ordered):

TypeError: Categoricals can only be compared if 'categories' are the same

Expected Output

I think this should return True. Unordered categories shouldn't care about the order :)

cc @JanSchulz thoughts?

jreback · 2017-04-16T18:53:14Z

yes this seems reasonable. creating a Categorical w/o cateogories creates categories in the order they are found, but that's an implementation detail.

if you change this how much breakage on tests? (IOW this is going to affect is_dtype_equal as well)

jankatins · 2017-04-16T20:49:22Z

Once upon a time, categoricals were modeled after factor in R and there, levels (=categories) were sorted:

> levels(factor(c('b','a', 'c')) )
[1] "a" "b" "c"

In that case, both cases would result in the same categories.

I think the sorting was removed ("categories in order of apearance") and this was overlooked? [edit] None of the commits show this (as far as my github blame skill reach...), so it either happend in one of the early discussions or in my imagination and this was simpy overlooked in the implementation.[/]

jorisvandenbossche · 2017-04-16T22:04:25Z

creating a Categorical w/o cateogories creates categories in the order they are found, but that's an implementation detail.

@jreback no, when no categories are specified, the found values are sorted:

In [1]: pd.Categorical(['b', 'a'])
Out[1]:
[b, a]
Categories (2, object): [a, b]

None of the commits show this (as far as my github blame skill reach...), so it either happend in one of the early discussions or in my imagination and this was simpy overlooked in the implementation

It just didn't happen AFAIK :-)

jreback · 2017-04-20T11:11:39Z

right, forgot we sort :>

so the question is why are we differentiating between passed categories and categories created from direct factorizing in the ordered=False case?

For ordered=False

I think leaving the passed categories as is is fine (e.g. no sorting)
a bit indifferent to whether we should sort factorized categories (a) (I know R does), or show them in order of appearance (b)

But these all should be equivalent
i

TomAugspurger · 2017-04-20T11:25:58Z

I'll play around with it when I'm doing the CategoricalDtype stuff. For ease of implementation, we'll have to do something. Having an "ordering" function that deterministically orders a given set of categories is necessary so that the codes match up.

In [8]: pd.Categorical(['b', 'a', 1])
Out[8]:
[b, a, 1]
Categories (3, object): [1, a, b]

In [9]: pd.Categorical(['b', 'a', 1, pd.Timestamp('2017')])
Out[9]:
[b, a, 1, 2017-01-01 00:00:00]
Categories (4, object): [b, a, 1, 2017-01-01 00:00:00]

jorisvandenbossche · 2017-04-20T11:36:42Z

If we are comparing to R, they do not care about order of categories when they are not ordered:

> f1 <- factor(c('b','a', 'c'))
> f2 <- factor(c('b','a', 'c'), levels = c('b', 'a', 'c'))
> f1
[1] b a c
Levels: a b c
> f2
[1] b a c
Levels: b a c
> f1 == f2
[1] TRUE TRUE TRUE

but ... they also don't care when they are ordered:

> f1 <- factor(c('b','a', 'c'), ordered = T)
> f1
[1] b a c
Levels: a < b < c
> f2 <- factor(c('b','a', 'c'), levels = c('b', 'a', 'c'), ordered = T)
> f2
[1] b a c
Levels: b < a < c
> f1 == f2
[1] TRUE TRUE TRUE

If we change something, I would personally keep it so that this second case (ordered categories with different order) raises when trying to compare.
For the unordered case, I am fine with changing this.

a bit indifferent to whether we should sort factorized categories (a) (I know R does), or show them in order of appearance (b)

I certainly think we should not change the current behaviour, which is have sorted categories when you do not specify them manually.

so the question is why are we differentiating between passed categories and categories created from direct factorizing in the ordered=False case?

Because it is a different case. In case of passed categories, you explicitly pass them, so I find it logical that pandas respects the passed values by the user.

BTW, there is no difference between the ordered is False vs True here. In both cases, if you do not specify categories yourself, they are derived from the unique values and sorted:

In [3]: pd.Categorical(['b', 'a'])
Out[3]: 
[b, a]
Categories (2, object): [a, b]

In [4]: pd.Categorical(['b', 'a'], ordered=True)
Out[4]: 
[b, a]
Categories (2, object): [a < b]

jreback · 2017-04-20T11:46:43Z

Because it is a different case. In case of passed categories, you explicitly pass them, so I find it logical that pandas respects the passed values by the user.

this is all fine, but then they should compare equally. The point of ``ordered=False` is the don't care about the ordering (so its a set rather than a list comparison).

TomAugspurger · 2017-04-20T11:47:08Z

If we change something, I would personally keep it so that this second case (ordered categories with different order) raises when trying to compare.
For the unordered case, I am fine with changing this.

Agreed

they are derived from the unique values and sorted:

Well... usually :)

In [1]: import pandas as pd

In [2]: pd.Categorical(['b', 'a', 1])
Out[2]:
[b, a, 1]
Categories (3, object): [1, a, b]

In [3]: pd.Categorical(['b', 'a', 1, pd.Timestamp('2017')])
Out[3]:
[b, a, 1, 2017-01-01 00:00:00]
Categories (4, object): [b, a, 1, 2017-01-01 00:00:00]

But I agree that the current behavior around building categorical shouldn't change, just the behavior when comparing them. I think we're all in agreement on that.

jorisvandenbossche · 2017-04-20T12:06:39Z

But I agree that the current behavior around building categorical shouldn't change, just the behavior when comparing them. I think we're all in agreement on that.

Yep!

Your corner case is a bit strange in that in the first case it seems to be sorted and in the second not, while in both cases the values are not sortable (eg if you would put them in an index or series and sort, you get an error).

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes #16014

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014 (cherry picked from commit 91e9e52)

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes #16014 (cherry picked from commit 91e9e52)

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014

TomAugspurger added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Apr 16, 2017

TomAugspurger added this to the Next Major Release milestone Apr 16, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 12, 2017

BUG: Categorical comparison with unordered

36da210

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 12, 2017

BUG: Categorical comparison with unordered

359a419

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 12, 2017

BUG: Categorical comparison with unordered

2d47107

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014

TomAugspurger mentioned this issue May 12, 2017

BUG: Categorical comparison with unordered #16339

Merged

jreback modified the milestones: 0.20.2, Next Major Release May 12, 2017

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 15, 2017

BUG: Categorical comparison with unordered

9910de4

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 18, 2017

BUG: Categorical comparison with unordered

4ec26d4

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014

TomAugspurger closed this as completed in #16339 May 18, 2017

TomAugspurger added a commit that referenced this issue May 18, 2017

BUG: Categorical comparison with unordered (#16339)

91e9e52

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes #16014

pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017

BUG: Categorical comparison with unordered (pandas-dev#16339)

a92566b

Fixes categorical comparison operations improperly considering ordering when two unordered categoricals are compared. Closes pandas-dev#16014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG/API: Unordered Categorical should ignore order in comparisons? #16014

BUG/API: Unordered Categorical should ignore order in comparisons? #16014

TomAugspurger commented Apr 16, 2017

jreback commented Apr 16, 2017

jankatins commented Apr 16, 2017 •

edited

Loading

jorisvandenbossche commented Apr 16, 2017 •

edited

Loading

jreback commented Apr 20, 2017

TomAugspurger commented Apr 20, 2017

jorisvandenbossche commented Apr 20, 2017 •

edited

Loading

jreback commented Apr 20, 2017

TomAugspurger commented Apr 20, 2017

jorisvandenbossche commented Apr 20, 2017

BUG/API: Unordered Categorical should ignore order in comparisons? #16014

BUG/API: Unordered Categorical should ignore order in comparisons? #16014

Comments

TomAugspurger commented Apr 16, 2017

Code Sample, a copy-pastable example if possible

Expected Output

jreback commented Apr 16, 2017

jankatins commented Apr 16, 2017 • edited Loading

jorisvandenbossche commented Apr 16, 2017 • edited Loading

jreback commented Apr 20, 2017

TomAugspurger commented Apr 20, 2017

jorisvandenbossche commented Apr 20, 2017 • edited Loading

jreback commented Apr 20, 2017

TomAugspurger commented Apr 20, 2017

jorisvandenbossche commented Apr 20, 2017

jankatins commented Apr 16, 2017 •

edited

Loading

jorisvandenbossche commented Apr 16, 2017 •

edited

Loading

jorisvandenbossche commented Apr 20, 2017 •

edited

Loading