-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: Fixed cut regression, improve Categorical #34952
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This has two changes to address a performance regression in cut / qcut. 1. Avoid an unnecessary `set` conversion in cut. 2. Aviod a costly conversion to object in the Categoriacl constructor for dtypes we don't have a hashtable for. ```python In [2]: idx = pd.interval_range(0, 1, periods=10000) In [3]: %timeit pd.Categorical(idx, idx) ``` ``` 10.4 ms ± 351 µs per loop (mean ± std. dev. of 7 runs, 100 loops each) 256 ms ± 5.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) 53.2 µs ± 1.26 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) ``` And for the qcut ASV ``` 58.5 ms ± 3.13 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 134 ms ± 9.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 29.8 ms ± 1.24 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) ``` Closes pandas-dev#33921
pandas/core/reshape/tile.py
Outdated
@@ -425,6 +426,7 @@ def _bins_to_cuts( | |||
labels = _format_labels( | |||
bins, precision, right=right, include_lowest=include_lowest, dtype=dtype | |||
) | |||
labels_are_unique = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking again, it may not be safe to assume that _format_labels
returns a unique IntervalIndex like I thought. Might revert this for now.
@@ -2611,6 +2611,11 @@ def _get_codes_for_values(values, categories): | |||
values = ensure_object(values) | |||
categories = ensure_object(categories) | |||
|
|||
if isinstance(categories, ABCIndexClass): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm shouldn't this be in the caller to this routine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's best here. We call it in ~3 places, and I think each would have to repeat this check & the object casting stuff.
pandas/core/reshape/tile.py
Outdated
categories=labels if len(set(labels)) == len(labels) else None, | ||
ordered=ordered, | ||
) | ||
if labels_are_unique: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why this extra logic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry... I thought I reverted this. Forgot to push.
But, the set(labels)
stuff is somewhat expensive (don't have the exact numbers right now). There may be cases where we know that we have unique labels and can avoid the check.
thanks @TomAugspurger |
This has two changes to address a performance regression in
cut / qcut.
1. Avoid an unnecessary(reverted for now)set
conversion in cut.2. Aviod a costly conversion to object in the Categoriacl constructor
for dtypes we don't have a hashtable for.
And for the qcut ASV
Closes #33921