Skip to content

PERF: Cache hashing of categories #40193

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Apr 12, 2021
Merged

PERF: Cache hashing of categories #40193

merged 3 commits into from
Apr 12, 2021

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Mar 3, 2021

Hashing of the dtype is used in numerous places of the code base. For example it is used in concat operations to determine whether all objects have the same dtype. For CategoricalDtype this operation can be extremely costly since it requires hashing the array/list of categories.

The following example shows the impact of this caching operation nicely since pd.concat hashes all dtypes at least twice

import string
import pandas as pd


def make_df():
    return pd.DataFrame(
        {
            "A": [
                string.ascii_letters[ix : (ix + 10) % len(string.ascii_letters)]
                for ix in range(10000)
            ],
            "B": [
                string.ascii_lowercase[ix : (ix + 10) % len(string.ascii_lowercase)]
                for ix in range(10000)
            ],
        }
    ).astype({"A": "category", "B": "category"})


dfs = [make_df() for _ in range(100)]
# without caching
%time pd.concat(dfs)
CPU times: user 74.1 ms, sys: 6.32 ms, total: 80.4 ms
Wall time: 80.2 ms

# with caching
%time pd.concat(dfs)
CPU times: user 46.1 ms, sys: 6.65 ms, total: 52.7 ms
Wall time: 52.1 ms

Note: On an earlier version (I believe it was 1.1.X but I need to check again) this caching brought this concat down to ~30ms but I haven't identified this regression, yet

  • closes N/A
  • N/A
  • Ensure all linting tests pass, see here for how to run them
  • whatsnew entry

@fjetter fjetter added the Performance Memory or execution speed performance label Mar 3, 2021
@jreback
Copy link
Contributor

jreback commented Mar 3, 2021

this is hit by existing asvs?

@fjetter
Copy link
Member Author

fjetter commented Mar 3, 2021

Yes, there are categoricals.Concat

before

asv run --show-stderr --python=same --bench=categoricals.Concat       


· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda3_envs_pandas-dev_bin_python
[  8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index                  6.82±0.2ms
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index                      2.73±0.1ms
[ 75.00%] ··· categoricals.Concat.time_concat                                       2.66±0.05ms
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index                  7.69±0.1ms
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index                      3.34±0.2ms
[100.00%] ··· categoricals.Concat.time_union                                         2.90±0.1ms

and after

· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_miniconda3_envs_pandas-dev_bin_python
[  8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index                  1.35±0.3ms
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index                       78.7±30μs
[ 75.00%] ··· categoricals.Concat.time_concat                                       2.03±0.06ms
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index                 2.03±0.05ms
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index                        405±80μs
[100.00%] ··· categoricals.Concat.time_union                                         3.11±0.5ms

The improvements here (e.g. categoricals.Concat.time_append_overlapping_index) seem to be a bit too spectacular to be true. That's probably due to the caching since the benchmark iterations do not invalidate the cache. Do we have some experience with this kind of thing? I'll happily change the implementation if there are some recommendations to avoid this problem

@fjetter
Copy link
Member Author

fjetter commented Mar 3, 2021

I believe the test failures are unrelated. I receive a lot of import errors regarding html5lib

@jreback jreback added this to the 1.3 milestone Mar 3, 2021
@jreback jreback added the Categorical Categorical Data Type label Mar 3, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great, can you add a whatsnew note (ref this PR). ping on green-ish

@jreback
Copy link
Contributor

jreback commented Mar 3, 2021

also wouldn't mind the example you have above as an additional concat asv

@fjetter fjetter changed the title PERF: Cach hashing of categories PERF: Cache hashing of categories Mar 3, 2021
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks!

@jorisvandenbossche
Copy link
Member

also wouldn't mind the example you have above as an additional concat asv

The existing benchmarks already clearly cover this case nicely, so I wouldn't add more benchmarks if not needed.

@@ -360,7 +363,8 @@ def __hash__(self) -> int:
else:
return -2
# We *do* want to include the real self.ordered here
return int(self._hash_categories(self.categories, self.ordered))
self._hash_cache = int(self._hash_categories(self.categories, self.ordered))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is _hash_categories ever called with anything other than (self.categories, self.ordered)? if not, could just make it a cache_readonly

Copy link
Member Author

@fjetter fjetter Mar 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the only place where this function is used. To be clear, you propose to do

    @cache_readonly
    def __hash__(self):
        ...

or would you add this cache to CategoricalDtype._hash_categories? _hash_categories is a static method and I would assume that breaks since different instances would access the same cache, wouldn't they?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the suggestion is that instead of a staticmethod it would become

@cache_readonly
def _hash_categories(self):
    categories = self.categories
    ordered = self.ordered 
    [the rest the same as now]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the cache_readonly but ran into two problems

  1. Circular imports if imported via panda.utils. Could only make this work if I import it directly from the _libs
  2. I do get failed tests which look like the cache was not attached to an instance but rather the class and/or there is some type confusion. Haven't debugged this further. The failing test is TestCategoricalDtypeParametrized.test_mixed
    def test_mixed(self):
        a = CategoricalDtype(["a", "b", 1, 2])
        b = CategoricalDtype(["a", "b", "1", "2"])
>       assert hash(a) != hash(b)
E       AssertionError: assert 7627359294613847363 != 7627359294613847363
E        +  where 7627359294613847363 = hash(CategoricalDtype(categories=['a', 'b', 1, 2], ordered=False))
E        +  and   7627359294613847363 = hash(CategoricalDtype(categories=['a', 'b', '1', '2'], ordered=False))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular imports if imported via panda.utils. Could only make this work if I import it directly from the _libs

yes do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. should not be the case, you will have to see why this occurs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was worth looking into this. The cause for 2.) is a name collision with the extension dtype instance caching

_cache: Dict[str_type, PandasExtensionDtype] = {}

since this and the property cache used the same attribute name _cache for their respective caches. Will update the PR shortly

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cache_readonly cache implementation is a bit buggy with some non-trivial implications. I'll open another PR for this

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #40329

@jreback
Copy link
Contributor

jreback commented Mar 8, 2021

@fjetter can you merge master and update

@fjetter
Copy link
Member Author

fjetter commented Mar 9, 2021

rebased and builds are green. Added a whatsnew entry but kept the logic for the caching as is (see comment above)

@jreback

@@ -360,7 +363,8 @@ def __hash__(self) -> int:
else:
return -2
# We *do* want to include the real self.ordered here
return int(self._hash_categories(self.categories, self.ordered))
self._hash_cache = int(self._hash_categories(self.categories, self.ordered))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Circular imports if imported via panda.utils. Could only make this work if I import it directly from the _libs

yes do this

@@ -360,7 +363,8 @@ def __hash__(self) -> int:
else:
return -2
# We *do* want to include the real self.ordered here
return int(self._hash_categories(self.categories, self.ordered))
self._hash_cache = int(self._hash_categories(self.categories, self.ordered))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. should not be the case, you will have to see why this occurs

@jbrockmendel
Copy link
Member

can you merge master and we'll see if we can push this through

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you re-run the perf tests to assert that your patch is working (ideally also run some select asv's)

@@ -895,13 +895,6 @@ cdef class Tick(SingleConstructorOffset):

raise ApplyTypeError(f"Unhandled type: {type(other).__name__}")

# --------------------------------------------------------------------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are superfluous?

@fjetter fjetter force-pushed the run_ci branch 2 times, most recently from 64b5b52 to 94031ed Compare April 6, 2021 11:28
@jreback
Copy link
Contributor

jreback commented Apr 9, 2021

can you merge master (the previous PR changes are stil here)

@jreback
Copy link
Contributor

jreback commented Apr 9, 2021

also if you'd re-run the asv's and post to make sure

@fjetter
Copy link
Member Author

fjetter commented Apr 12, 2021

Reran both master and this branch

master

asv run --show-stderr --python=same --bench=categoricals.Concat
· Discovering benchmarks
· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_mambaforge_envs_pandas-dev-39_bin_python
[  8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index  4.23±0.08ms
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index         1.78±0ms
[ 75.00%] ··· categoricals.Concat.time_concat                            467±8μs
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index  4.39±0.01ms
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index      2.01±0.04ms
[100.00%] ··· categoricals.Concat.time_union                         1.68±0.08ms

This branch

· Discovering benchmarks
· Running 6 total benchmarks (1 commits * 1 environments * 6 benchmarks)
[  0.00%] ·· Benchmarking existing-py_Users_fjetter_mambaforge_envs_pandas-dev-39_bin_python
[  8.33%] ··· Running (categoricals.Concat.time_append_non_overlapping_index--)......
[ 58.33%] ··· categoricals.Concat.time_append_non_overlapping_index     585±20μs
[ 66.67%] ··· categoricals.Concat.time_append_overlapping_index       21.3±0.6μs
[ 75.00%] ··· categoricals.Concat.time_concat                           371±20μs
[ 83.33%] ··· categoricals.Concat.time_concat_non_overlapping_index     886±60μs
[ 91.67%] ··· categoricals.Concat.time_concat_overlapping_index          162±1μs
[100.00%] ··· categoricals.Concat.time_union                         1.59±0.06ms

Note the drop in e.g. time_concat_overlapping_index is only that large because we do not invalidate the cache during benchmarks. realistic performance gains should be in the two digit percentages.

Using the above make_df functions with one-time execution gives something like a 20% drop

master

%time pd.concat(dfs)
CPU times: user 50.2 ms, sys: 2.08 ms, total: 52.3 ms
Wall time: 51.2 ms

this branch

%time pd.concat(dfs)
CPU times: user 36.2 ms, sys: 4.37 ms, total: 40.6 ms
Wall time: 39.4 ms

@jreback jreback merged commit 67e1c09 into pandas-dev:master Apr 12, 2021
@jreback
Copy link
Contributor

jreback commented Apr 12, 2021

thanks @fjetter

@fjetter fjetter deleted the run_ci branch April 12, 2021 14:14
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants