-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: Cache hashing of categories #40193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this is hit by existing asvs? |
Yes, there are before
and after
The improvements here (e.g. |
I believe the test failures are unrelated. I receive a lot of import errors regarding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, can you add a whatsnew note (ref this PR). ping on green-ish
also wouldn't mind the example you have above as an additional concat asv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks!
The existing benchmarks already clearly cover this case nicely, so I wouldn't add more benchmarks if not needed. |
pandas/core/dtypes/dtypes.py
Outdated
@@ -360,7 +363,8 @@ def __hash__(self) -> int: | |||
else: | |||
return -2 | |||
# We *do* want to include the real self.ordered here | |||
return int(self._hash_categories(self.categories, self.ordered)) | |||
self._hash_cache = int(self._hash_categories(self.categories, self.ordered)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is _hash_categories ever called with anything other than (self.categories, self.ordered)
? if not, could just make it a cache_readonly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the only place where this function is used. To be clear, you propose to do
@cache_readonly
def __hash__(self):
...
or would you add this cache to CategoricalDtype._hash_categories
? _hash_categories
is a static method and I would assume that breaks since different instances would access the same cache, wouldn't they?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the suggestion is that instead of a staticmethod it would become
@cache_readonly
def _hash_categories(self):
categories = self.categories
ordered = self.ordered
[the rest the same as now]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried the cache_readonly
but ran into two problems
- Circular imports if imported via
panda.utils
. Could only make this work if I import it directly from the_libs
- I do get failed tests which look like the cache was not attached to an instance but rather the class and/or there is some type confusion. Haven't debugged this further. The failing test is
TestCategoricalDtypeParametrized.test_mixed
def test_mixed(self):
a = CategoricalDtype(["a", "b", 1, 2])
b = CategoricalDtype(["a", "b", "1", "2"])
> assert hash(a) != hash(b)
E AssertionError: assert 7627359294613847363 != 7627359294613847363
E + where 7627359294613847363 = hash(CategoricalDtype(categories=['a', 'b', 1, 2], ordered=False))
E + and 7627359294613847363 = hash(CategoricalDtype(categories=['a', 'b', '1', '2'], ordered=False))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Circular imports if imported via panda.utils. Could only make this work if I import it directly from the _libs
yes do this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- should not be the case, you will have to see why this occurs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was worth looking into this. The cause for 2.) is a name collision with the extension dtype instance caching
pandas/pandas/core/dtypes/dtypes.py
Line 89 in ce082fb
_cache: Dict[str_type, PandasExtensionDtype] = {} |
since this and the property cache used the same attribute name _cache
for their respective caches. Will update the PR shortly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The cache_readonly cache implementation is a bit buggy with some non-trivial implications. I'll open another PR for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #40329
@fjetter can you merge master and update |
rebased and builds are green. Added a whatsnew entry but kept the logic for the caching as is (see comment above) |
pandas/core/dtypes/dtypes.py
Outdated
@@ -360,7 +363,8 @@ def __hash__(self) -> int: | |||
else: | |||
return -2 | |||
# We *do* want to include the real self.ordered here | |||
return int(self._hash_categories(self.categories, self.ordered)) | |||
self._hash_cache = int(self._hash_categories(self.categories, self.ordered)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Circular imports if imported via panda.utils. Could only make this work if I import it directly from the _libs
yes do this
pandas/core/dtypes/dtypes.py
Outdated
@@ -360,7 +363,8 @@ def __hash__(self) -> int: | |||
else: | |||
return -2 | |||
# We *do* want to include the real self.ordered here | |||
return int(self._hash_categories(self.categories, self.ordered)) | |||
self._hash_cache = int(self._hash_categories(self.categories, self.ordered)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- should not be the case, you will have to see why this occurs
can you merge master and we'll see if we can push this through |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you re-run the perf tests to assert that your patch is working (ideally also run some select asv's)
pandas/_libs/tslibs/offsets.pyx
Outdated
@@ -895,13 +895,6 @@ cdef class Tick(SingleConstructorOffset): | |||
|
|||
raise ApplyTypeError(f"Unhandled type: {type(other).__name__}") | |||
|
|||
# -------------------------------------------------------------------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these are superfluous?
64b5b52
to
94031ed
Compare
can you merge master (the previous PR changes are stil here) |
also if you'd re-run the asv's and post to make sure |
Reran both master and this branch masterasv run --show-stderr --python=same --bench=categoricals.Concat
· Discovering benchmarks
· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[ 0.00%] ·· Benchmarking existing-py_Users_fjetter_mambaforge_envs_pandas-dev-39_bin_python
[ 8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index 4.23±0.08ms
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index 1.78±0ms
[ 75.00%] ··· categoricals.Concat.time_concat 467±8μs
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index 4.39±0.01ms
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index 2.01±0.04ms
[100.00%] ··· categoricals.Concat.time_union 1.68±0.08ms This branch· Discovering benchmarks
· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[ 0.00%] ·· Benchmarking existing-py_Users_fjetter_mambaforge_envs_pandas-dev-39_bin_python
[ 8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index 585±20μs
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index 21.3±0.6μs
[ 75.00%] ··· categoricals.Concat.time_concat 371±20μs
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index 886±60μs
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index 162±1μs
[100.00%] ··· categoricals.Concat.time_union 1.59±0.06ms Note the drop in e.g. Using the above master%time pd.concat(dfs)
CPU times: user 50.2 ms, sys: 2.08 ms, total: 52.3 ms
Wall time: 51.2 ms this branch%time pd.concat(dfs)
CPU times: user 36.2 ms, sys: 4.37 ms, total: 40.6 ms
Wall time: 39.4 ms |
thanks @fjetter |
Hashing of the dtype is used in numerous places of the code base. For example it is used in concat operations to determine whether all objects have the same dtype. For
CategoricalDtype
this operation can be extremely costly since it requires hashing the array/list of categories.The following example shows the impact of this caching operation nicely since
pd.concat
hashes all dtypes at least twiceNote: On an earlier version (I believe it was 1.1.X but I need to check again) this caching brought this concat down to ~30ms but I haven't identified this regression, yet