-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
CategoricalIndex.equals incorrectly considers category order when unordered #16603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
yes that sounds right |
Joining is broken because of this. This randomly broke a real world script depending on input data. Real headache to troubleshoot! Test case:
Output:
pd.show_versions()
INSTALLED VERSIONS
------------------
commit: None
python: 3.5.3.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-33-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.21.0 |
@naure do you have time to submit a PR fixing this? I haven't had a chance,
and won't for a little while yet.
…On Mon, Dec 4, 2017 at 8:46 AM, naure ***@***.***> wrote:
Joining is broken because of this. This randomly broke a real world script
depending on input data. Real headache to troubleshoot!
Test case:
# Same categories but in different order
Xi = pd.Categorical(["A"], categories=["A", "B"])
Yi = pd.Categorical(["B"], categories=["B", "A"])
x = pd.DataFrame(1, columns=["x"], index=Xi)
assert "B" not in x.index
y = pd.DataFrame(2, columns=["y"], index=Yi)
assert "B" in y.index
xy = x.join(y, how="inner")
print(x)
print(y)
print("Incorrect join:")
print(xy)
# It used the codes instead of the labels
assert x.index.codes == y.index.codes == xy.index.codes == [0]
assert len(xy) == 0, "Should be empty" # Broken
assert "B" not in xy.index # Broken
assert "B" not in xy # Actually passes somehow
Output:
x
A 1
y
B 2
Incorrect join:
x y
B 1 2
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-690-c66e71d6d8ba> in <module>()
18 assert x.index.codes == y.index.codes == xy.index.codes == [0]
19
---> 20 assert len(xy) == 0, "Should be empty" # Broken
21 assert "B" not in xy.index # Broken
22 assert "B" not in xy # Actually passes somehow
AssertionError: Should be empty
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#16603 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHInR9ZCKhWKkRPYWU7TDF3Ofnd22bks5s9AWzgaJpZM4NwHJG>
.
|
I can have a look, where to start? |
Thanks. CategoricalIndex is defined at https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/category.py. It inherits the I apparently started a branch for this, master...TomAugspurger:categorical-index-set-ops. I can't recall what was broken, but I suspect it was the set-ops issues with duplicates in #13432 The basic idea in that branch was to override the set ops methods for |
No I think this is actually a regression in equality testing of Categorical, independently of the effort you mentioned. Commit. It's not clear whether it should ultimately be equal or not, but if it does, that breaks a number of operations. Here is a monkey-patch for anyone who needs it safely working regardless of version.
Test:
@TomAugspurger Would like this as a PR? |
Why do you say that?
What is "it" in this case that's being compared? That equality you linked to is on the CategoricalDtype, whose equality semantics are a bit strange especially around empty Categoricals. |
Because it used to work in 0.20.1. The root cause is the definition of equality that changed: Xi.equals(Yi). Starting with 0.21.0, it returns True, which leads other operations to manipulate raw codes. This gives meaningless results as the categories are different. |
Apologies for my slowness, I was looking at So, as you say, Categorical.equals should be defined as:
The issue with the current definition is that it assumes dtype equality implies that mapping between values and codes match, which isn't always true. So I'd propose something like: if self.is_dtype_equal(other):
if self.categories.equals(other.categories):
# the fast case for codes aligning
np.array_equal(self._codes, other._codes)
else:
# unorded categories with different order
codes2 = _recode_for_categories(Yi.codes, Yi.categories, Xi.categories)
return np.array_equal(self._codes, codes2)
return False then we should be OK. I'll submit that as a PR later today, unless you beat me to it :) |
The original issue was already fixed. I added tests to verify (but no whatsnew entry). This addes tests and a fix for pandas-dev#16603 (comment) about `Categorical.equals` Closes pandas-dev#16603
The original issue was already fixed. I added tests to verify (but no whatsnew entry). This addes tests and a fix for pandas-dev#16603 (comment) about `Categorical.equals` Closes pandas-dev#16603
* BUG: Fixed Categorical.Equals with unordered The original issue was already fixed. I added tests to verify (but no whatsnew entry). This addes tests and a fix for #16603 (comment) about `Categorical.equals` Closes #16603 * simplify * Release note
I fixed this for regular Categoricals in #16339, but that didn't affect CategoricalIndex.equals. Do we want
.equals
to ignore order whenordered=False
on both (I think we do)?Discovered during my CategoricalDtype refactor, which does fix it so that order is ignored when
order=False
.The text was updated successfully, but these errors were encountered: