Skip to content

PERF: Fixed cut regression, improve Categorical #34952

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
Jun 24, 2020

Conversation

TomAugspurger
Copy link
Contributor

@TomAugspurger TomAugspurger commented Jun 23, 2020

This has two changes to address a performance regression in
cut / qcut.

1. Avoid an unnecessary set conversion in cut. (reverted for now)
2. Aviod a costly conversion to object in the Categoriacl constructor
for dtypes we don't have a hashtable for.

In [2]: idx = pd.interval_range(0, 1, periods=10000)
In [3]: %timeit pd.Categorical(idx, idx)
# 1.0.4
10.4 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# master
256 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# HEAD
53.2 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

And for the qcut ASV

# 1.0.4
58.5 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# master
134 ms ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

# HEAD
53.6 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Closes #33921

This has two changes to address a performance regression in
cut / qcut.

1. Avoid an unnecessary `set` conversion in cut.
2. Aviod a costly conversion to object in the Categoriacl constructor
   for dtypes we don't have a hashtable for.

```python
In [2]: idx = pd.interval_range(0, 1, periods=10000)
In [3]: %timeit pd.Categorical(idx, idx)
```

```
10.4 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

256 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

53.2 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
```

And for the qcut ASV

```
58.5 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

134 ms ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

29.8 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Closes pandas-dev#33921
@@ -425,6 +426,7 @@ def _bins_to_cuts(
labels = _format_labels(
bins, precision, right=right, include_lowest=include_lowest, dtype=dtype
)
labels_are_unique = True
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking again, it may not be safe to assume that _format_labels returns a unique IntervalIndex like I thought. Might revert this for now.

@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Jun 23, 2020
@TomAugspurger TomAugspurger added this to the 1.1 milestone Jun 23, 2020
@TomAugspurger TomAugspurger added Categorical Categorical Data Type cut cut, qcut Regression Functionality that used to work in a prior pandas version labels Jun 23, 2020
@@ -2611,6 +2611,11 @@ def _get_codes_for_values(values, categories):
values = ensure_object(values)
categories = ensure_object(categories)

if isinstance(categories, ABCIndexClass):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm shouldn't this be in the caller to this routine?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's best here. We call it in ~3 places, and I think each would have to repeat this check & the object casting stuff.

categories=labels if len(set(labels)) == len(labels) else None,
ordered=ordered,
)
if labels_are_unique:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this extra logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry... I thought I reverted this. Forgot to push.

But, the set(labels) stuff is somewhat expensive (don't have the exact numbers right now). There may be cases where we know that we have unique labels and can avoid the check.

@jreback jreback merged commit 314ac9a into pandas-dev:master Jun 24, 2020
@jreback
Copy link
Contributor

jreback commented Jun 24, 2020

thanks @TomAugspurger

@TomAugspurger TomAugspurger deleted the factorize-perf branch June 25, 2020 11:26
fangchenli pushed a commit to fangchenli/pandas that referenced this pull request Jun 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type cut cut, qcut Performance Memory or execution speed performance Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Performance regression in reshape.Cut.time_qcut_timedelta
2 participants