PERF: Cache hashing of categories #40193

fjetter · 2021-03-03T08:34:37Z

Hashing of the dtype is used in numerous places of the code base. For example it is used in concat operations to determine whether all objects have the same dtype. For CategoricalDtype this operation can be extremely costly since it requires hashing the array/list of categories.

The following example shows the impact of this caching operation nicely since pd.concat hashes all dtypes at least twice

import string
import pandas as pd


def make_df():
    return pd.DataFrame(
        {
            "A": [
                string.ascii_letters[ix : (ix + 10) % len(string.ascii_letters)]
                for ix in range(10000)
            ],
            "B": [
                string.ascii_lowercase[ix : (ix + 10) % len(string.ascii_lowercase)]
                for ix in range(10000)
            ],
        }
    ).astype({"A": "category", "B": "category"})


dfs = [make_df() for _ in range(100)]

# without caching
%time pd.concat(dfs)
CPU times: user 74.1 ms, sys: 6.32 ms, total: 80.4 ms
Wall time: 80.2 ms

# with caching
%time pd.concat(dfs)
CPU times: user 46.1 ms, sys: 6.65 ms, total: 52.7 ms
Wall time: 52.1 ms

Note: On an earlier version (I believe it was 1.1.X but I need to check again) this caching brought this concat down to ~30ms but I haven't identified this regression, yet

closes N/A
N/A
Ensure all linting tests pass, see here for how to run them
whatsnew entry

jreback · 2021-03-03T11:38:17Z

this is hit by existing asvs?

fjetter · 2021-03-03T12:05:52Z

Yes, there are categoricals.Concat

before

asv run --show-stderr --python=same --bench=categoricals.Concat       


· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda3_envs_pandas-dev_bin_python
[  8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index                  6.82±0.2ms
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index                      2.73±0.1ms
[ 75.00%] ··· categoricals.Concat.time_concat                                       2.66±0.05ms
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index                  7.69±0.1ms
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index                      3.34±0.2ms
[100.00%] ··· categoricals.Concat.time_union                                         2.90±0.1ms

and after

· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda3_envs_pandas-dev_bin_python
[  8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index                  1.35±0.3ms
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index                       78.7±30μs
[ 75.00%] ··· categoricals.Concat.time_concat                                       2.03±0.06ms
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index                 2.03±0.05ms
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index                        405±80μs
[100.00%] ··· categoricals.Concat.time_union                                         3.11±0.5ms

The improvements here (e.g. categoricals.Concat.time_append_overlapping_index) seem to be a bit too spectacular to be true. That's probably due to the caching since the benchmark iterations do not invalidate the cache. Do we have some experience with this kind of thing? I'll happily change the implementation if there are some recommendations to avoid this problem

fjetter · 2021-03-03T12:10:18Z

I believe the test failures are unrelated. I receive a lot of import errors regarding html5lib

jreback

great, can you add a whatsnew note (ref this PR). ping on green-ish

jreback · 2021-03-03T12:52:15Z

also wouldn't mind the example you have above as an additional concat asv

jorisvandenbossche

Cool, thanks!

jorisvandenbossche · 2021-03-03T16:06:31Z

also wouldn't mind the example you have above as an additional concat asv

The existing benchmarks already clearly cover this case nicely, so I wouldn't add more benchmarks if not needed.

pandas/core/dtypes/dtypes.py

jbrockmendel · 2021-03-03T23:47:52Z

pandas/core/dtypes/dtypes.py

@@ -360,7 +363,8 @@ def __hash__(self) -> int:
            else:
                return -2
        # We *do* want to include the real self.ordered here
-        return int(self._hash_categories(self.categories, self.ordered))
+        self._hash_cache = int(self._hash_categories(self.categories, self.ordered))


is _hash_categories ever called with anything other than (self.categories, self.ordered)? if not, could just make it a cache_readonly

This is the only place where this function is used. To be clear, you propose to do

@cache_readonly def __hash__(self): ...

or would you add this cache to CategoricalDtype._hash_categories? _hash_categories is a static method and I would assume that breaks since different instances would access the same cache, wouldn't they?

the suggestion is that instead of a staticmethod it would become

@cache_readonly def _hash_categories(self): categories = self.categories ordered = self.ordered [the rest the same as now]

I tried the cache_readonly but ran into two problems

Circular imports if imported via panda.utils. Could only make this work if I import it directly from the _libs

I do get failed tests which look like the cache was not attached to an instance but rather the class and/or there is some type confusion. Haven't debugged this further. The failing test is TestCategoricalDtypeParametrized.test_mixed

def test_mixed(self): a = CategoricalDtype(["a", "b", 1, 2]) b = CategoricalDtype(["a", "b", "1", "2"]) > assert hash(a) != hash(b) E AssertionError: assert 7627359294613847363 != 7627359294613847363 E + where 7627359294613847363 = hash(CategoricalDtype(categories=['a', 'b', 1, 2], ordered=False)) E + and 7627359294613847363 = hash(CategoricalDtype(categories=['a', 'b', '1', '2'], ordered=False))

Circular imports if imported via panda.utils. Could only make this work if I import it directly from the _libs

yes do this

should not be the case, you will have to see why this occurs

Was worth looking into this. The cause for 2.) is a name collision with the extension dtype instance caching

pandas/pandas/core/dtypes/dtypes.py

Line 89 in ce082fb

_cache: Dict[str_type, PandasExtensionDtype] = {}

since this and the property cache used the same attribute name _cache for their respective caches. Will update the PR shortly

The cache_readonly cache implementation is a bit buggy with some non-trivial implications. I'll open another PR for this

xref #40329

jreback · 2021-03-08T14:33:31Z

@fjetter can you merge master and update

fjetter · 2021-03-09T10:25:48Z

rebased and builds are green. Added a whatsnew entry but kept the logic for the caching as is (see comment above)

@jreback

jreback · 2021-03-09T13:27:36Z

pandas/core/dtypes/dtypes.py

@@ -360,7 +363,8 @@ def __hash__(self) -> int:
            else:
                return -2
        # We *do* want to include the real self.ordered here
-        return int(self._hash_categories(self.categories, self.ordered))
+        self._hash_cache = int(self._hash_categories(self.categories, self.ordered))


Circular imports if imported via panda.utils. Could only make this work if I import it directly from the _libs

yes do this

jreback · 2021-03-09T13:28:27Z

pandas/core/dtypes/dtypes.py

@@ -360,7 +363,8 @@ def __hash__(self) -> int:
            else:
                return -2
        # We *do* want to include the real self.ordered here
-        return int(self._hash_categories(self.categories, self.ordered))
+        self._hash_cache = int(self._hash_categories(self.categories, self.ordered))


should not be the case, you will have to see why this occurs

jbrockmendel · 2021-03-22T17:25:16Z

can you merge master and we'll see if we can push this through

jreback

can you re-run the perf tests to assert that your patch is working (ideally also run some select asv's)

jreback · 2021-04-05T15:51:05Z

pandas/_libs/tslibs/offsets.pyx

@@ -895,13 +895,6 @@ cdef class Tick(SingleConstructorOffset):

        raise ApplyTypeError(f"Unhandled type: {type(other).__name__}")

-    # --------------------------------------------------------------------


these are superfluous?

jreback · 2021-04-09T15:50:05Z

can you merge master (the previous PR changes are stil here)

jreback · 2021-04-09T15:50:25Z

also if you'd re-run the asv's and post to make sure

fjetter · 2021-04-12T09:27:49Z

Reran both master and this branch

master

asv run --show-stderr --python=same --bench=categoricals.Concat
· Discovering benchmarks
· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_mambaforge_envs_pandas-dev-39_bin_python
[  8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index  4.23±0.08ms
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index         1.78±0ms
[ 75.00%] ··· categoricals.Concat.time_concat                            467±8μs
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index  4.39±0.01ms
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index      2.01±0.04ms
[100.00%] ··· categoricals.Concat.time_union                         1.68±0.08ms

This branch

· Discovering benchmarks
· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_mambaforge_envs_pandas-dev-39_bin_python
[  8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index     585±20μs
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index       21.3±0.6μs
[ 75.00%] ··· categoricals.Concat.time_concat                           371±20μs
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index     886±60μs
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index          162±1μs
[100.00%] ··· categoricals.Concat.time_union                         1.59±0.06ms

Note the drop in e.g. time_concat_overlapping_index is only that large because we do not invalidate the cache during benchmarks. realistic performance gains should be in the two digit percentages.

Using the above make_df functions with one-time execution gives something like a 20% drop

master

%time pd.concat(dfs)
CPU times: user 50.2 ms, sys: 2.08 ms, total: 52.3 ms
Wall time: 51.2 ms

this branch

%time pd.concat(dfs)
CPU times: user 36.2 ms, sys: 4.37 ms, total: 40.6 ms
Wall time: 39.4 ms

jreback · 2021-04-12T12:35:42Z

thanks @fjetter

fjetter added the Performance Memory or execution speed performance label Mar 3, 2021

jreback added this to the 1.3 milestone Mar 3, 2021

jreback added the Categorical Categorical Data Type label Mar 3, 2021

jreback requested changes Mar 3, 2021

View reviewed changes

fjetter changed the title ~~PERF: Cach hashing of categories~~ PERF: Cache hashing of categories Mar 3, 2021

jorisvandenbossche approved these changes Mar 3, 2021

View reviewed changes

jbrockmendel reviewed Mar 3, 2021

View reviewed changes

pandas/core/dtypes/dtypes.py Outdated Show resolved Hide resolved

jbrockmendel reviewed Mar 3, 2021

View reviewed changes

fjetter force-pushed the run_ci branch from 98dcdf8 to 2adb306 Compare March 9, 2021 08:38

jreback requested changes Mar 9, 2021

View reviewed changes

fjetter force-pushed the run_ci branch from 2adb306 to ab64538 Compare March 9, 2021 13:50

fjetter force-pushed the run_ci branch from ab64538 to 47b7484 Compare April 5, 2021 15:10

jreback requested changes Apr 5, 2021

View reviewed changes

fjetter force-pushed the run_ci branch 2 times, most recently from 64b5b52 to 94031ed Compare April 6, 2021 11:28

fjetter added 2 commits April 9, 2021 10:49

fix cache_readonly

6bc7e0c

Add whatsnew

cf30dc7

Cache hashing of categories in CategoricalDtype

9fdd0b8

fjetter force-pushed the run_ci branch from 94031ed to 9fdd0b8 Compare April 12, 2021 09:27

jreback approved these changes Apr 12, 2021

View reviewed changes

jreback merged commit 67e1c09 into pandas-dev:master Apr 12, 2021

fjetter mentioned this pull request Apr 12, 2021

Bug Ensure cache_readonly is only applied on instance #40329

Closed

4 tasks

fjetter deleted the run_ci branch April 12, 2021 14:14

JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021

PERF: Cache hashing of categories (pandas-dev#40193)

88c20fd

		@@ -895,13 +895,6 @@ cdef class Tick(SingleConstructorOffset):

		raise ApplyTypeError(f"Unhandled type: {type(other).__name__}")

		# --------------------------------------------------------------------

Uh oh!

PERF: Cache hashing of categories #40193

PERF: Cache hashing of categories #40193

Uh oh!

Conversation

fjetter commented Mar 3, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Mar 3, 2021

Uh oh!

fjetter commented Mar 3, 2021

Uh oh!

fjetter commented Mar 3, 2021

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 3, 2021

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Mar 3, 2021

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fjetter Mar 4, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 8, 2021

Uh oh!

fjetter commented Mar 9, 2021

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jbrockmendel commented Mar 22, 2021

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Apr 9, 2021

Uh oh!

jreback commented Apr 9, 2021

Uh oh!

fjetter commented Apr 12, 2021

master

This branch

master

this branch

Uh oh!

jreback commented Apr 12, 2021

Uh oh!

Uh oh!

fjetter commented Mar 3, 2021 •

edited

Loading

fjetter Mar 4, 2021 •

edited

Loading