Skip to content

PERF: CategoricalDtype.__eq__ #36280

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Sep 13, 2020
Merged

Conversation

jbrockmendel
Copy link
Member

Three special cases where we can get better performance

  1. categories have different lengths --> never equal
  2. categories have different dtypes --> never equal
  3. categories are not object-dtype --> can use use get_indexer instead of hashing
import pandas as pd
import numpy as np

dti = pd.date_range("2016-01-01", periods=10000, tz="US/Pacific")
dti2 = type(dti)(dti._data.copy())

np.random.shuffle(dti._data._data)
assert not dti.equals(dti2)

cd1 = pd.CategoricalDtype(dti)
cd2 = pd.CategoricalDtype(dti2)
cd3 = pd.CategoricalDtype(dti2[:-1])
cd4 = pd.CategoricalDtype(pd.Index(range(len(dti))))

In [5]: %timeit cd1 == cd2 
548 µs ± 3.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- master
239 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- PR

In [6]: %timeit cd1 == cd3
543 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   # <-- master
5.05 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # <-- PR

In [7]: %timeit cd1 == cd4
389 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- master
2.17 µs ± 63.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

@jreback jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Sep 12, 2020
@jreback jreback added this to the 1.2 milestone Sep 12, 2020
self.categories.dtype == other.categories.dtype
and self.categories.equals(other.categories)
):
left = self.categories
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a comment that the ordering of checks is for perf

):
left = self.categories
right = other.categories
if not left.dtype == right.dtype:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_dtype_equal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thatd require a circular import (actually id like to put is_dtype_equal "above" this file, but thats for another day)

@jreback jreback merged commit 1478291 into pandas-dev:master Sep 13, 2020
@jbrockmendel jbrockmendel deleted the perf-cat-cmp branch September 13, 2020 15:03
kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants