-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API: Set ops for CategoricalIndex #10186
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
So i would operate on both values/categories:
|
One option to consider is to (for now) only allow these operations with indexes with the same categories |
I think you simply call |
IMO (and I think not all agree here), a Therefore a So: >>> pd.Index([1, 2, 3, 1, 2, 3]).intersection(pd.Index(["2", "3", "4", "2", "3", "4"]))
Index([], dtype='object')
>>> pd.CategoricalIndex([1, 2, 3, 1, 2, 3]).intersection(pd.CategoricalIndex([2, 3, 4, 2, 3, 4]))
# because the underlying categoricals have different categories [1,2,3] and [2,3,4}
Index([], dtype='object') See also this: >>> pd.Categorical([1,2,3], ordered=True) > pd.Categorical([2,3,4], ordered=True)
TypeError: Categoricals can only be compared if 'categories' are the same
>>> 1 > "2" # on py3, py2 is ...
TypeError: unorderable types: int() > str() |
For reference, R doesn't care the order of categories and remove duplicated categories. intersect(as.factor(c(1, 2, 3)), as.factor(c(2, 3, 4)))
# [1] "2" "3"
intersect(as.factor(c(1, 2, 3, 1, 2, 3)), as.factor(c(2, 3, 4, 2, 3, 4)))
# [1] "2" "3"
intersect(c(1, 2, 3, 1, 2, 3), c(2, 3, 4, 2, 3, 4))
# [1] 2 3 Let me summarize current opinions and choises. If I misunderstand, please lmk:
|
Resurrecting this as part of my CategoricalDtype refactor. The semantics in
Currently In [22]: a = pd.CategoricalIndex(['a', 'b'], categories=['a', 'b', 'c'], ordered=True)
In [23]: b = pd.CategoricalIndex(['b', 'c'], categories=['a', 'b', 'c'], ordered=True)
In [24]: a.union(b).ordered
Out[24]: False I think we'll follow those rules on the categories for each of the set operations. |
Actually, I think we can handle additional cases with
We could even support union over categoricals with "gaps" like
These rules should work for intersect, difference, and symmetric difference too. |
You guys probably already know this, but in case not, FYI: This is the current (24.2) behavior for union of categorical indices, which makes it difficult to do anything involving two slightly different categorical indices: >>>pd.CategoricalIndex([1, 2, 4]).union(pd.CategoricalIndex([2, 3, 4]))
CategoricalIndex([1, 2, 4, nan], categories=[1, 2, 4], ordered=False, dtype='category') not CategoricalIndex([1, 2, 4, 3], categories=[1, 2, 4, 3], ordered=False, dtype='category')
# or something |
Derived from #10157. Would like to clarify what these results should be. Basically, I think:
CategoricalIndex
Index
which has the same original values.category
should only include categories which the result actually has.Followings are current results.
intersection
union
Doc says "Form the union of two Index objects and sorts if possible". I'm not sure whether the last sentence says "raise error if sort is impossible" or "not sort if impossible"?
difference
sym_diff
The text was updated successfully, but these errors were encountered: