-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BENCH: Fix CategoricalIndexing benchmark #38476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BENCH: Fix CategoricalIndexing benchmark #38476
Conversation
…ndexing_benchmark
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this!
@@ -281,7 +283,7 @@ def time_get_loc_scalar(self, index): | |||
self.data.get_loc(self.cat_scalar) | |||
|
|||
def time_get_indexer_list(self, index): | |||
self.data.get_indexer(self.cat_list) | |||
self.data_unique.get_indexer(self.cat_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the timing you get for this? Because data_unique
is only a small index, while data
uses N = 10 ** 5
. We might want to make data_unique
larger as well?
(there is a rands
helper in pandas._testing to create random strings of a certain length)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After #38372, CategoricalIndex.get_indexer
raises an error if if the index has duplicate values; so I think to keep this benchmark we need to use this smaller index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can still have more unique strings when combining multiple characters (which is what rands
does). Just want to be sure that it's not a tiny benchmark that might not catch actual regressions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good point. Here's a short timeit of this benchmark.
In [3]: %timeit idx.get_indexer(['a', 'c'])
736 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
If you think this is too tiny I can bump it up, though I think asv regressions are reported as a percentage correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's too little. Based on a quick profile, we are mostly measuring here the overhead (converting ['a', 'c'] to an Index, checking the dtypes, etc), which doesn't necessarily scale linearly with the size of the data. While the actual indexing operation is only about 10% of the time. So assume that we have a regression there that makes it go x2, we would hardly notice that in this benchmark's timing.
Now, we would maybe also solve that by making cat_list
longer, instead of making idx
longer (eg int_list
has 10000 elements, cat_list
only 2. Doing cat_list = ['a', 'c'] * 5000
would probably already help.
thanks @mroeschke |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
This benchmark behavior changed after #38372
Additionally, exploring if its feasible to always run the benchmarks to catch these changes earlier.