Skip to content

BENCH: Fix CategoricalIndexing benchmark #38476

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

mroeschke
Copy link
Member

  • closes #xxxx
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

This benchmark behavior changed after #38372

Additionally, exploring if its feasible to always run the benchmarks to catch these changes earlier.

@mroeschke mroeschke added the Benchmark Performance (ASV) benchmarks label Dec 15, 2020
@mroeschke mroeschke added this to the 1.3 milestone Dec 15, 2020
Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this!

@@ -281,7 +283,7 @@ def time_get_loc_scalar(self, index):
self.data.get_loc(self.cat_scalar)

def time_get_indexer_list(self, index):
self.data.get_indexer(self.cat_list)
self.data_unique.get_indexer(self.cat_list)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the timing you get for this? Because data_unique is only a small index, while data uses N = 10 ** 5. We might want to make data_unique larger as well?

(there is a rands helper in pandas._testing to create random strings of a certain length)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #38372, CategoricalIndex.get_indexer raises an error if if the index has duplicate values; so I think to keep this benchmark we need to use this smaller index.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still have more unique strings when combining multiple characters (which is what rands does). Just want to be sure that it's not a tiny benchmark that might not catch actual regressions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good point. Here's a short timeit of this benchmark.

In [3]: %timeit idx.get_indexer(['a', 'c'])
736 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you think this is too tiny I can bump it up, though I think asv regressions are reported as a percentage correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's too little. Based on a quick profile, we are mostly measuring here the overhead (converting ['a', 'c'] to an Index, checking the dtypes, etc), which doesn't necessarily scale linearly with the size of the data. While the actual indexing operation is only about 10% of the time. So assume that we have a regression there that makes it go x2, we would hardly notice that in this benchmark's timing.

Now, we would maybe also solve that by making cat_list longer, instead of making idx longer (eg int_list has 10000 elements, cat_list only 2. Doing cat_list = ['a', 'c'] * 5000 would probably already help.

@jreback jreback merged commit 1138f18 into pandas-dev:master Dec 16, 2020
@jreback
Copy link
Contributor

jreback commented Dec 16, 2020

thanks @mroeschke

@mroeschke mroeschke deleted the fix/categorical_indexing_benchmark branch December 16, 2020 01:33
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants