PERF: CategoricalDtype.eq #36280

jbrockmendel · 2020-09-11T01:52:01Z

Three special cases where we can get better performance

categories have different lengths --> never equal
categories have different dtypes --> never equal
categories are not object-dtype --> can use use get_indexer instead of hashing

import pandas as pd
import numpy as np

dti = pd.date_range("2016-01-01", periods=10000, tz="US/Pacific")
dti2 = type(dti)(dti._data.copy())

np.random.shuffle(dti._data._data)
assert not dti.equals(dti2)

cd1 = pd.CategoricalDtype(dti)
cd2 = pd.CategoricalDtype(dti2)
cd3 = pd.CategoricalDtype(dti2[:-1])
cd4 = pd.CategoricalDtype(pd.Index(range(len(dti))))

In [5]: %timeit cd1 == cd2 
548 µs ± 3.94 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- master
239 µs ± 1.04 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- PR

In [6]: %timeit cd1 == cd3
543 µs ± 4.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)   # <-- master
5.05 µs ± 257 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) # <-- PR

In [7]: %timeit cd1 == cd4
389 µs ± 13.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- master
2.17 µs ± 63.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)  # <-- PR

…rf-cat-cmp

pandas/core/dtypes/dtypes.py

jreback · 2020-09-12T20:32:43Z

pandas/core/dtypes/dtypes.py

-                self.categories.dtype == other.categories.dtype
-                and self.categories.equals(other.categories)
-            ):
+            left = self.categories


can you add a comment that the ordering of checks is for perf

jreback · 2020-09-12T20:32:53Z

pandas/core/dtypes/dtypes.py

-            ):
+            left = self.categories
+            right = other.categories
+            if not left.dtype == right.dtype:


is_dtype_equal

thatd require a circular import (actually id like to put is_dtype_equal "above" this file, but thats for another day)

…rf-cat-cmp

jbrockmendel added 2 commits September 10, 2020 18:44

PERF: CategoricalDtype.__eq__

2845100

Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…

4680cd4

…rf-cat-cmp

jreback added Categorical Categorical Data Type Performance Memory or execution speed performance labels Sep 12, 2020

jreback added this to the 1.2 milestone Sep 12, 2020

jreback requested changes Sep 12, 2020

View reviewed changes

jbrockmendel added 2 commits September 12, 2020 14:40

Merge branch 'master' of https://github.com/pandas-dev/pandas into pe…

0ef1393

…rf-cat-cmp

comment

0091f5f

jreback approved these changes Sep 13, 2020

View reviewed changes

jreback merged commit 1478291 into pandas-dev:master Sep 13, 2020

jbrockmendel deleted the perf-cat-cmp branch September 13, 2020 15:03

kesmit13 pushed a commit to kesmit13/pandas that referenced this pull request Nov 2, 2020

PERF: CategoricalDtype.__eq__ (pandas-dev#36280)

eb38bb7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: CategoricalDtype.eq #36280

PERF: CategoricalDtype.eq #36280

jbrockmendel commented Sep 11, 2020

jreback Sep 12, 2020

jreback Sep 12, 2020

jbrockmendel Sep 12, 2020

PERF: CategoricalDtype.__eq__ #36280

PERF: CategoricalDtype.__eq__ #36280

Conversation

jbrockmendel commented Sep 11, 2020

jreback Sep 12, 2020

Choose a reason for hiding this comment

jreback Sep 12, 2020

Choose a reason for hiding this comment

jbrockmendel Sep 12, 2020

Choose a reason for hiding this comment

PERF: CategoricalDtype.eq #36280

PERF: CategoricalDtype.eq #36280