Skip to content

CategoricalIndex.equals incorrectly considers category order when unordered #16603

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Jun 5, 2017 · 9 comments · Fixed by #18822
Closed

CategoricalIndex.equals incorrectly considers category order when unordered #16603

TomAugspurger opened this issue Jun 5, 2017 · 9 comments · Fixed by #18822
Labels
API Design Categorical Categorical Data Type
Milestone

Comments

@TomAugspurger
Copy link
Contributor

>>> import pandas as pd
>>> pd.CategoricalIndex(['a', 'b'], categories=['a', 'b']).equals(
...     pd.CategoricalIndex(['a', 'b'], categories=['b', 'a']))
False

I fixed this for regular Categoricals in #16339, but that didn't affect CategoricalIndex.equals. Do we want .equals to ignore order when ordered=False on both (I think we do)?

Discovered during my CategoricalDtype refactor, which does fix it so that order is ignored when order=False.

@TomAugspurger TomAugspurger added API Design Categorical Categorical Data Type labels Jun 5, 2017
@TomAugspurger TomAugspurger added this to the Next Major Release milestone Jun 5, 2017
@jreback
Copy link
Contributor

jreback commented Jun 5, 2017

yes that sounds right

@naure
Copy link

naure commented Dec 4, 2017

Joining is broken because of this. This randomly broke a real world script depending on input data. Real headache to troubleshoot!

Test case:

# Same categories but in different order
Xi = pd.Categorical(["A"], categories=["A", "B"])
Yi = pd.Categorical(["B"], categories=["B", "A"])

x = pd.DataFrame(1, columns=["x"], index=Xi)
assert "B" not in x.index
y = pd.DataFrame(2, columns=["y"], index=Yi)
assert "B" in y.index

xy = x.join(y, how="inner")

print(x)
print(y)
print("Incorrect join:")
print(xy)

# It used the codes instead of the labels
assert x.index.codes == y.index.codes == xy.index.codes == [0]

assert len(xy) == 0, "Should be empty"  # Broken
assert "B" not in xy.index              # Broken
assert "B" not in xy                    # Actually passes somehow

Output:

   x
A  1

   y
B  2

Incorrect join:
   x  y
B  1  2

---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-690-c66e71d6d8ba> in <module>()
     18 assert x.index.codes == y.index.codes == xy.index.codes == [0]
     19 
---> 20 assert len(xy) == 0, "Should be empty"  # Broken
     21 assert "B" not in xy.index              # Broken
     22 assert "B" not in xy                    # Actually passes somehow

AssertionError: Should be empty

pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.5.3.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-33-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.21.0
pytest: None
pip: 9.0.1
setuptools: 36.0.1
Cython: None
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.1.0
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.3
blosc: None
bottleneck: 1.2.1
tables: 3.4.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.999999999
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 4, 2017 via email

@naure
Copy link

naure commented Dec 4, 2017

I can have a look, where to start?

@TomAugspurger
Copy link
Contributor Author

Thanks. CategoricalIndex is defined at https://github.com/pandas-dev/pandas/blob/master/pandas/core/indexes/category.py. It inherits the union / intersection methods from Index, defined in indexes/base.py.

I apparently started a branch for this, master...TomAugspurger:categorical-index-set-ops. I can't recall what was broken, but I suspect it was the set-ops issues with duplicates in #13432

The basic idea in that branch was to override the set ops methods for CategoricalIndex with versions that do the regular set-ops on the codes. And then we construct a new CategoricalIndex based on the codes. Feel free to pick that up, or start fresh and ping if you have issues.

@naure
Copy link

naure commented Dec 5, 2017

No I think this is actually a regression in equality testing of Categorical, independently of the effort you mentioned. Commit. It's not clear whether it should ultimately be equal or not, but if it does, that breaks a number of operations.

Here is a monkey-patch for anyone who needs it safely working regardless of version.
This restores the behaviour of 0.20.1 regarding equality of Categorical and CategoricalIndex.
concat() & co will return regular Index and Series if categories are different in any way.

import pandas as pd

def Categorical_is_dtype_equal(self, other):
    """
    Returns True if categoricals are the same dtype
      same categories, and same ordered

    Parameters
    ----------
    other : Categorical

    Returns
    -------
    are_equal : boolean
    """

    try:
        return (self.categories.equals(other.categories) and
                self.ordered == other.ordered)
    except (AttributeError, TypeError):
        return False

pd.core.categorical.Categorical.is_dtype_equal = Categorical_is_dtype_equal

Test:

# Same categories but in different order
Xi = pd.Categorical(["A"], categories=["A", "B"])
Yi = pd.Categorical(["B"], categories=["B", "A"])
assert not Xi.equals(Yi)

x = pd.DataFrame(1, columns=["x"], index=Xi)
y = pd.DataFrame(2, columns=["y"], index=Yi)
assert not x.index.equals(y.index)

xy = x.join(y, how="inner")
assert len(xy) == 0, "Inner join should be empty"

@TomAugspurger Would like this as a PR?

@TomAugspurger
Copy link
Contributor Author

No I think this is actually a regression in equality testing of Categorical, independently of the effort you mentioned.

Why do you say that?

It's not clear whether it should ultimately be equal or not, but if it does, that breaks a number of operations.

What is "it" in this case that's being compared?

That equality you linked to is on the CategoricalDtype, whose equality semantics are a bit strange especially around empty Categoricals.

@naure
Copy link

naure commented Dec 5, 2017

Because it used to work in 0.20.1. The root cause is the definition of equality that changed: Xi.equals(Yi).

Starting with 0.21.0, it returns True, which leads other operations to manipulate raw codes. This gives meaningless results as the categories are different.

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Dec 5, 2017

Apologies for my slowness, I was looking at CategoricalIndex.equals, not Categorical.equals.

So, as you say, Categorical.equals should be defined as:

  1. The dtypes are equal (categories match and ordering of categories is unimportant when both are unorderd)
  2. The values are equal

The issue with the current definition is that it assumes dtype equality implies that mapping between values and codes match, which isn't always true. So I'd propose something like:

if self.is_dtype_equal(other):
    if self.categories.equals(other.categories):
        # the fast case for codes aligning
        np.array_equal(self._codes, other._codes)
    else:
        # unorded categories with different order
        codes2 = _recode_for_categories(Yi.codes, Yi.categories, Xi.categories)
        return np.array_equal(self._codes, codes2)
return False

then we should be OK. I'll submit that as a PR later today, unless you beat me to it :)

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 18, 2017
The original issue was already fixed. I added tests to verify (but no whatsnew entry).

This addes tests and a fix for pandas-dev#16603 (comment) about `Categorical.equals`

Closes pandas-dev#16603
@jreback jreback modified the milestones: Next Major Release, 0.22.0 Dec 19, 2017
TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 28, 2017
The original issue was already fixed. I added tests to verify (but no whatsnew entry).

This addes tests and a fix for pandas-dev#16603 (comment) about `Categorical.equals`

Closes pandas-dev#16603
TomAugspurger added a commit that referenced this issue Jan 6, 2018
* BUG: Fixed Categorical.Equals with unordered

The original issue was already fixed. I added tests to verify (but no whatsnew entry).

This addes tests and a fix for #16603 (comment) about `Categorical.equals`

Closes #16603

* simplify

* Release note
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants