Skip to content

BUG/API: Unordered Categorical should ignore order in comparisons? #16014

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Apr 16, 2017 · 9 comments
Closed
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Milestone

Comments

@TomAugspurger
Copy link
Contributor

I think that when comparing two unordered categorical-dtyped series with categories which differ only by ordered should compare equal

Code Sample, a copy-pastable example if possible

In [1]: import pandas as pd
c
In [2]: c1 = pd.Series(pd.Categorical(['a', 'b']))

In [3]: c2 = pd.Series(pd.Categorical(['a', 'b'], categories=['b', 'a']))

In [4]: c1
Out[4]:
0    a
1    b
dtype: category
Categories (2, object): [a, b]

In [5]: c2
Out[5]:
0    a
1    b
dtype: category
Categories (2, object): [b, a]

In [6]: c1 == c2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-6-d8d43a43a02a> in <module>()
----> 1 c1 == c2

/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/ops.py in wrapper(self, other, axis)
    811                 msg = 'Can only compare identically-labeled Series objects'
    812                 raise ValueError(msg)
--> 813             return self._constructor(na_op(self.values, other.values),
    814                                      index=self.index, name=name)
    815         elif isinstance(other, pd.DataFrame):  # pragma: no cover

/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/ops.py in na_op(x, y)
    752         # in either operand
    753         if is_categorical_dtype(x):
--> 754             return op(x, y)
    755         elif is_categorical_dtype(y) and not isscalar(y):
    756             return op(y, x)

/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/categorical.py in f(self, other)
     55             if ((len(self.categories) != len(other.categories)) or
     56                     not ((self.categories == other.categories).all())):
---> 57                 raise TypeError("Categoricals can only be compared if "
     58                                 "'categories' are the same")
     59             if not (self.ordered == other.ordered):

TypeError: Categoricals can only be compared if 'categories' are the same

Expected Output

I think this should return True. Unordered categories shouldn't care about the order :)

cc @JanSchulz thoughts?

@TomAugspurger TomAugspurger added Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions labels Apr 16, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Apr 16, 2017
@jreback
Copy link
Contributor

jreback commented Apr 16, 2017

yes this seems reasonable. creating a Categorical w/o cateogories creates categories in the order they are found, but that's an implementation detail.

if you change this how much breakage on tests? (IOW this is going to affect is_dtype_equal as well)

@jankatins
Copy link
Contributor

jankatins commented Apr 16, 2017

Once upon a time, categoricals were modeled after factor in R and there, levels (=categories) were sorted:

> levels(factor(c('b','a', 'c')) )
[1] "a" "b" "c"

In that case, both cases would result in the same categories.

I think the sorting was removed ("categories in order of apearance") and this was overlooked? [edit] None of the commits show this (as far as my github blame skill reach...), so it either happend in one of the early discussions or in my imagination and this was simpy overlooked in the implementation.[/]

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 16, 2017

creating a Categorical w/o cateogories creates categories in the order they are found, but that's an implementation detail.

@jreback no, when no categories are specified, the found values are sorted:

In [1]: pd.Categorical(['b', 'a'])
Out[1]:
[b, a]
Categories (2, object): [a, b]

None of the commits show this (as far as my github blame skill reach...), so it either happend in one of the early discussions or in my imagination and this was simpy overlooked in the implementation

It just didn't happen AFAIK :-)

@jreback
Copy link
Contributor

jreback commented Apr 20, 2017

right, forgot we sort :>

so the question is why are we differentiating between passed categories and categories created from direct factorizing in the ordered=False case?

For ordered=False

  • I think leaving the passed categories as is is fine (e.g. no sorting)
  • a bit indifferent to whether we should sort factorized categories (a) (I know R does), or show them in order of appearance (b)

But these all should be equivalent
i

@TomAugspurger
Copy link
Contributor Author

I'll play around with it when I'm doing the CategoricalDtype stuff. For ease of implementation, we'll have to do something. Having an "ordering" function that deterministically orders a given set of categories is necessary so that the codes match up.

In [8]: pd.Categorical(['b', 'a', 1])
Out[8]:
[b, a, 1]
Categories (3, object): [1, a, b]

In [9]: pd.Categorical(['b', 'a', 1, pd.Timestamp('2017')])
Out[9]:
[b, a, 1, 2017-01-01 00:00:00]
Categories (4, object): [b, a, 1, 2017-01-01 00:00:00]

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 20, 2017

If we are comparing to R, they do not care about order of categories when they are not ordered:

> f1 <- factor(c('b','a', 'c'))
> f2 <- factor(c('b','a', 'c'), levels = c('b', 'a', 'c'))
> f1
[1] b a c
Levels: a b c
> f2
[1] b a c
Levels: b a c
> f1 == f2
[1] TRUE TRUE TRUE

but ... they also don't care when they are ordered:

> f1 <- factor(c('b','a', 'c'), ordered = T)
> f1
[1] b a c
Levels: a < b < c
> f2 <- factor(c('b','a', 'c'), levels = c('b', 'a', 'c'), ordered = T)
> f2
[1] b a c
Levels: b < a < c
> f1 == f2
[1] TRUE TRUE TRUE

If we change something, I would personally keep it so that this second case (ordered categories with different order) raises when trying to compare.
For the unordered case, I am fine with changing this.

a bit indifferent to whether we should sort factorized categories (a) (I know R does), or show them in order of appearance (b)

I certainly think we should not change the current behaviour, which is have sorted categories when you do not specify them manually.

so the question is why are we differentiating between passed categories and categories created from direct factorizing in the ordered=False case?

Because it is a different case. In case of passed categories, you explicitly pass them, so I find it logical that pandas respects the passed values by the user.

BTW, there is no difference between the ordered is False vs True here. In both cases, if you do not specify categories yourself, they are derived from the unique values and sorted:

In [3]: pd.Categorical(['b', 'a'])
Out[3]: 
[b, a]
Categories (2, object): [a, b]

In [4]: pd.Categorical(['b', 'a'], ordered=True)
Out[4]: 
[b, a]
Categories (2, object): [a < b]

@jreback
Copy link
Contributor

jreback commented Apr 20, 2017

Because it is a different case. In case of passed categories, you explicitly pass them, so I find it logical that pandas respects the passed values by the user.

this is all fine, but then they should compare equally. The point of ``ordered=False` is the don't care about the ordering (so its a set rather than a list comparison).

@TomAugspurger
Copy link
Contributor Author

If we change something, I would personally keep it so that this second case (ordered categories with different order) raises when trying to compare.
For the unordered case, I am fine with changing this.

Agreed

they are derived from the unique values and sorted:

Well... usually :)

In [1]: import pandas as pd

In [2]: pd.Categorical(['b', 'a', 1])
Out[2]:
[b, a, 1]
Categories (3, object): [1, a, b]

In [3]: pd.Categorical(['b', 'a', 1, pd.Timestamp('2017')])
Out[3]:
[b, a, 1, 2017-01-01 00:00:00]
Categories (4, object): [b, a, 1, 2017-01-01 00:00:00]

But I agree that the current behavior around building categorical shouldn't change, just the behavior when comparing them. I think we're all in agreement on that.

@jorisvandenbossche
Copy link
Member

But I agree that the current behavior around building categorical shouldn't change, just the behavior when comparing them. I think we're all in agreement on that.

Yep!

Your corner case is a bit strange in that in the first case it seems to be sorted and in the second not, while in both cases the values are not sortable (eg if you would put them in an index or series and sort, you get an error).

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 12, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 12, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 12, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
@jreback jreback modified the milestones: 0.20.2, Next Major Release May 12, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 15, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 18, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
TomAugspurger added a commit that referenced this issue May 18, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014
pcluo pushed a commit to pcluo/pandas that referenced this issue May 22, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue May 29, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
(cherry picked from commit 91e9e52)
TomAugspurger added a commit that referenced this issue May 30, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes #16014
(cherry picked from commit 91e9e52)
stangirala pushed a commit to stangirala/pandas that referenced this issue Jun 11, 2017
Fixes categorical comparison operations improperly considering
ordering when two unordered categoricals are compared.

Closes pandas-dev#16014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Dtype Conversions Unexpected or buggy dtype conversions
Projects
None yet
Development

No branches or pull requests

4 participants