Skip to content

BENCH: Fix CategoricalIndexing benchmark #38476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 4 additions & 8 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -74,14 +74,10 @@ jobs:
asv check -E existing
git remote add upstream https://github.com/pandas-dev/pandas.git
git fetch upstream
if git diff upstream/master --name-only | grep -q "^asv_bench/"; then
asv machine --yes
asv dev | sed "/failed$/ s/^/##[error]/" | tee benchmarks.log
if grep "failed" benchmarks.log > /dev/null ; then
exit 1
fi
else
echo "Benchmarks did not run, no changes detected"
asv machine --yes
asv dev | sed "/failed$/ s/^/##[error]/" | tee benchmarks.log
if grep "failed" benchmarks.log > /dev/null ; then
exit 1
fi
if: always()

Expand Down
4 changes: 3 additions & 1 deletion asv_bench/benchmarks/indexing.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
lower-level methods directly on Index and subclasses, see index_object.py,
indexing_engine.py, and index_cached.py
"""
import string
import warnings

import numpy as np
Expand Down Expand Up @@ -255,6 +256,7 @@ def setup(self, index):
"non_monotonic": CategoricalIndex(list("abc" * N)),
}
self.data = indices[index]
self.data_unique = CategoricalIndex(list(string.printable))

self.int_scalar = 10000
self.int_list = list(range(10000))
Expand All @@ -281,7 +283,7 @@ def time_get_loc_scalar(self, index):
self.data.get_loc(self.cat_scalar)

def time_get_indexer_list(self, index):
self.data.get_indexer(self.cat_list)
self.data_unique.get_indexer(self.cat_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the timing you get for this? Because data_unique is only a small index, while data uses N = 10 ** 5. We might want to make data_unique larger as well?

(there is a rands helper in pandas._testing to create random strings of a certain length)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #38372, CategoricalIndex.get_indexer raises an error if if the index has duplicate values; so I think to keep this benchmark we need to use this smaller index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still have more unique strings when combining multiple characters (which is what rands does). Just want to be sure that it's not a tiny benchmark that might not catch actual regressions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. Here's a short timeit of this benchmark.

In [3]: %timeit idx.get_indexer(['a', 'c'])
736 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you think this is too tiny I can bump it up, though I think asv regressions are reported as a percentage correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's too little. Based on a quick profile, we are mostly measuring here the overhead (converting ['a', 'c'] to an Index, checking the dtypes, etc), which doesn't necessarily scale linearly with the size of the data. While the actual indexing operation is only about 10% of the time. So assume that we have a regression there that makes it go x2, we would hardly notice that in this benchmark's timing.

Now, we would maybe also solve that by making cat_list longer, instead of making idx longer (eg int_list has 10000 elements, cat_list only 2. Doing cat_list = ['a', 'c'] * 5000 would probably already help.



class MethodLookup:
Expand Down