-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed #21659
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: avoid unneeded recoding of categoricals and reuse CategoricalDtypes for greater slicing speed #21659
Conversation
@@ -169,7 +169,7 @@ def _create_categorical(cls, data, categories=None, ordered=None, | |||
data = data.set_categories(categories, ordered=ordered) | |||
elif ordered is not None and ordered != data.ordered: | |||
data = data.set_ordered(ordered) | |||
if isinstance(dtype, CategoricalDtype): | |||
if isinstance(dtype, CategoricalDtype) and dtype != data.dtype: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be noted that in master this dtype comparison is quite slow:
>>> d = pd.api.types.CategoricalDtype(categories=['a', 'b', 'c'])
>>> %timeit d == d
149 µs # master
524 ns # this PR
This is the reason for the focus on improving dtype comparisons also in this PR, so this check doesn't cause slowdowns on other parts of pandas. Otherwise the performance benefits of this PR would be ambivalent, causing some slowdowns also).
The CircleCi error seems unrelated : |
Codecov Report
@@ Coverage Diff @@
## master #21659 +/- ##
==========================================
+ Coverage 91.9% 91.9% +<.01%
==========================================
Files 154 154
Lines 49555 49559 +4
==========================================
+ Hits 45542 45546 +4
Misses 4013 4013
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add to the perf issue for 0.24.0 (If we only have them on 0.23.2, then pls add a new one)
pandas/core/dtypes/dtypes.py
Outdated
""" | ||
if isinstance(other, compat.string_types): | ||
return other == self.name | ||
|
||
if other is self: | ||
return True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob doesn't make any difference, but this could be elif (existing if)
Comments accounted for. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks reasonable at a quick glance.
thanks @topper-123 |
…ypes for greater slicing speed (pandas-dev#21659)
git diff upstream/master -u -- "*.py" | flake8 --diff
Currently, a lot of indexing/slicing ops on CategoricalIndexes goes through
CategoricalIndex._create_categorical
, which can be a slow operation, because callingdata._set_dtype
there is slow. This PR improves performance by avoiding callingdata._set_dtype
as often, specifically when the new and the old dtypes are equal.An complicating issue was that comparisons of
CategoricalDtype
was quite slow, so the improvements I saw in #20395 were offset by slowdowns in other places. To avoid this,CategoricalDtype.__equal__
needed to become smarter and pandas has to pass round existing dtypes rather than onlycategories
andordered
. This is ok, as CategoricalDtype is immutable. This minimizes the need to call the expensiveCategoricalDtype.__hash__
method inCategoricalDtype.__equal__
and makes comparisons much faster.Some notable results
First setup:
Results:
Benchmarks
benchmarks/indexing.py:
benchmarks/categoricals.py:
I haven't run the whole test suite, as that takes a long time (4-5 hours?) for me. Would appreciate input first and, if I need to run the whole suite, if there is a smarter way to do it.