PERF: Fixed cut regression, improve Categorical #34952

TomAugspurger · 2020-06-23T14:43:50Z

This has two changes to address a performance regression in
cut / qcut.

~~1. Avoid an unnecessary set conversion in cut.~~ (reverted for now)
2. Aviod a costly conversion to object in the Categoriacl constructor
for dtypes we don't have a hashtable for.

In [2]: idx = pd.interval_range(0, 1, periods=10000)
In [3]: %timeit pd.Categorical(idx, idx)

# 1.0.4
10.4 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# master
256 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# HEAD
53.2 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And for the qcut ASV

# 1.0.4
58.5 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# master
134 ms ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# HEAD
53.6 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Closes #33921

This has two changes to address a performance regression in cut / qcut. 1. Avoid an unnecessary `set` conversion in cut. 2. Aviod a costly conversion to object in the Categoriacl constructor for dtypes we don't have a hashtable for. ```python In [2]: idx = pd.interval_range(0, 1, periods=10000) In [3]: %timeit pd.Categorical(idx, idx) ``` ``` 10.4 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 256 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 53.2 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` And for the qcut ASV ``` 58.5 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 134 ms ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 29.8 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` Closes pandas-dev#33921

TomAugspurger · 2020-06-23T14:47:04Z

pandas/core/reshape/tile.py

@@ -425,6 +426,7 @@ def _bins_to_cuts(
            labels = _format_labels(
                bins, precision, right=right, include_lowest=include_lowest, dtype=dtype
            )
+            labels_are_unique = True


Looking again, it may not be safe to assume that _format_labels returns a unique IntervalIndex like I thought. Might revert this for now.

jreback · 2020-06-23T16:01:27Z

pandas/core/arrays/categorical.py

@@ -2611,6 +2611,11 @@ def _get_codes_for_values(values, categories):
        values = ensure_object(values)
        categories = ensure_object(categories)

+    if isinstance(categories, ABCIndexClass):


hmm shouldn't this be in the caller to this routine?

I think it's best here. We call it in ~3 places, and I think each would have to repeat this check & the object casting stuff.

jreback · 2020-06-23T16:01:54Z

pandas/core/reshape/tile.py

-                categories=labels if len(set(labels)) == len(labels) else None,
-                ordered=ordered,
-            )
+            if labels_are_unique:


why this extra logic?

Sorry... I thought I reverted this. Forgot to push.

But, the set(labels) stuff is somewhat expensive (don't have the exact numbers right now). There may be cases where we know that we have unique labels and can avoid the check.

jreback · 2020-06-24T22:37:04Z

thanks @TomAugspurger

TomAugspurger commented Jun 23, 2020

View reviewed changes

revert tile changes

429c30d

TomAugspurger added the Performance Memory or execution speed performance label Jun 23, 2020

TomAugspurger added this to the 1.1 milestone Jun 23, 2020

TomAugspurger added Categorical Categorical Data Type cut cut, qcut Regression Functionality that used to work in a prior pandas version labels Jun 23, 2020

jreback requested changes Jun 23, 2020

View reviewed changes

TomAugspurger added 2 commits June 24, 2020 12:25

handle overlapping

4fcacbc

fixup

e32f2b3

jreback approved these changes Jun 24, 2020

View reviewed changes

jreback merged commit 314ac9a into pandas-dev:master Jun 24, 2020

TomAugspurger deleted the factorize-perf branch June 25, 2020 11:26

fangchenli pushed a commit to fangchenli/pandas that referenced this pull request Jun 27, 2020

PERF: Fixed cut regression, improve Categorical (pandas-dev#34952)

817799c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Fixed cut regression, improve Categorical #34952

PERF: Fixed cut regression, improve Categorical #34952

TomAugspurger commented Jun 23, 2020 •

edited

Loading

TomAugspurger Jun 23, 2020

jreback Jun 23, 2020

TomAugspurger Jun 23, 2020

jreback Jun 23, 2020

TomAugspurger Jun 23, 2020

jreback commented Jun 24, 2020

PERF: Fixed cut regression, improve Categorical #34952

PERF: Fixed cut regression, improve Categorical #34952

Conversation

TomAugspurger commented Jun 23, 2020 • edited Loading

TomAugspurger Jun 23, 2020

Choose a reason for hiding this comment

jreback Jun 23, 2020

Choose a reason for hiding this comment

TomAugspurger Jun 23, 2020

Choose a reason for hiding this comment

jreback Jun 23, 2020

Choose a reason for hiding this comment

TomAugspurger Jun 23, 2020

Choose a reason for hiding this comment

jreback commented Jun 24, 2020

TomAugspurger commented Jun 23, 2020 •

edited

Loading