PERF: faster categorical ops for equal or larger than scalar #29820

topper-123 · 2019-11-23T23:03:34Z

np.nan is always encoded to -1 in a Categorical, so encoded smaller than all other possible values.

So, if we're checking if a Categorical is equal or larger than a scalar, it is not necessary to find locations of -1, as those will always be False already.

Performance

>>> n = 1_000_000
>>> c = pd.Categorical([np.nan] * n + ['b'] * n + ['c'] * n,
                       dtype=pd.CategoricalDtype(['a', 'b', 'c']), ordered=True)
>>> %timeit c == 'b'
9.79 ms ± 32.4 µs per loop  # master
3.94 ms ± 38.7 µs per loop  # this PR
# the improvement isn't available for less than etc. ops
>>> %timeit c <= 'b'
10.1 ms ± 44.1 µs per loop  # both master and this PR

gfyoung · 2019-11-23T23:55:40Z

doc/source/whatsnew/v1.0.0.rst

- Performance improvement when comparing a :meth:`Categorical` with a scalar and the scalar is not found in the categories (:issue:`29750`)
+- Performance improvement when comparing a :class:`Categorical` with a scalar and the scalar is not found in the categories (:issue:`29750`)
+- Performance improvement when checking if values in a :class:`Categorical` are equal, equal or larger or larger than a given scalar.
+  The improvement is not present if checking if the :class:`Categorical` is less than or less than or equal than the scalar (:issue:`xxxxx`)


Why don't we have an issue number here? Otherwise, we should use the PR number.

Yeah, added.

gfyoung · 2019-11-23T23:56:08Z

pandas/core/arrays/categorical.py

+                if opname not in {"eq", "__eq__", "ge", "__ge__", "gt", "__gt__"}:
+                    # check for NaN needed if we are not equal or larger
+                    mask = self._codes == -1
+                    ret[mask] = False


Do we have a performance test for this?

No, no ASV's for this ATM. I actually can't get ASV to run locally, maybe a Windows issue?

Anyway, I've added I've a ASV test set, but haven't been able to run it myself, unforfunately. Isn't there a web page, where we post ASVs?

jreback · 2019-11-24T00:06:24Z

pandas/core/arrays/categorical.py

-                # check for NaN in self
-                mask = self._codes == -1
-                ret[mask] = False
+                if opname not in {"eq", "__eq__", "ge", "__ge__", "gt", "__gt__"}:


this is very strange that you are checking for eq and eq it’s just eq

Yeah, changed.

topper-123 · 2019-11-24T00:51:56Z

asv_bench/benchmarks/categoricals.py

-    def time_union(self):
-        union_categoricals([self.a, self.b])
-
-


Just moved this down so the constructor checks are first, which seems logical.

jreback · 2019-11-25T22:55:02Z

thanks @topper-123

…dev#29820)

PERF: faster categorical ops for equal or larger scalar

313f590

topper-123 changed the title ~~PERF: faster categorical ops for equal or larger scalars~~ PERF: faster categorical ops for equal or larger than scalar Nov 23, 2019

gfyoung added Categorical Categorical Data Type Performance Memory or execution speed performance labels Nov 23, 2019

gfyoung reviewed Nov 23, 2019

View reviewed changes

jreback requested changes Nov 24, 2019

View reviewed changes

Changes according to comments

2c41268

topper-123 commented Nov 24, 2019

View reviewed changes

jreback added this to the 1.0 milestone Nov 25, 2019

jreback approved these changes Nov 25, 2019

View reviewed changes

jreback merged commit de3db0a into pandas-dev:master Nov 25, 2019

topper-123 deleted the categorical_scalar_perf branch November 25, 2019 22:58

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

PERF: faster categorical ops for equal or larger than scalar (pandas-…

90ca876

…dev#29820)

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

PERF: faster categorical ops for equal or larger than scalar (pandas-…

fc61a17

…dev#29820)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: faster categorical ops for equal or larger than scalar #29820

PERF: faster categorical ops for equal or larger than scalar #29820

topper-123 commented Nov 23, 2019 •

edited

Loading

gfyoung Nov 23, 2019

topper-123 Nov 24, 2019

gfyoung Nov 23, 2019

topper-123 Nov 24, 2019 •

edited

Loading

jreback Nov 24, 2019

topper-123 Nov 24, 2019

topper-123 Nov 24, 2019

jreback commented Nov 25, 2019

PERF: faster categorical ops for equal or larger than scalar #29820

PERF: faster categorical ops for equal or larger than scalar #29820

Conversation

topper-123 commented Nov 23, 2019 • edited Loading

Performance

gfyoung Nov 23, 2019

Choose a reason for hiding this comment

topper-123 Nov 24, 2019

Choose a reason for hiding this comment

gfyoung Nov 23, 2019

Choose a reason for hiding this comment

topper-123 Nov 24, 2019 • edited Loading

Choose a reason for hiding this comment

jreback Nov 24, 2019

Choose a reason for hiding this comment

topper-123 Nov 24, 2019

Choose a reason for hiding this comment

topper-123 Nov 24, 2019

Choose a reason for hiding this comment

jreback commented Nov 25, 2019

topper-123 commented Nov 23, 2019 •

edited

Loading

topper-123 Nov 24, 2019 •

edited

Loading