BENCH: Fix CategoricalIndexing benchmark #38476

mroeschke · 2020-12-14T22:13:59Z

This benchmark behavior changed after #38372

Additionally, exploring if its feasible to always run the benchmarks to catch these changes earlier.

…ndexing_benchmark

jorisvandenbossche

Thanks for fixing this!

jorisvandenbossche · 2020-12-15T13:12:20Z

asv_bench/benchmarks/indexing.py

@@ -281,7 +283,7 @@ def time_get_loc_scalar(self, index):
        self.data.get_loc(self.cat_scalar)

    def time_get_indexer_list(self, index):
-        self.data.get_indexer(self.cat_list)
+        self.data_unique.get_indexer(self.cat_list)


What is the timing you get for this? Because data_unique is only a small index, while data uses N = 10 ** 5. We might want to make data_unique larger as well?

(there is a rands helper in pandas._testing to create random strings of a certain length)

After #38372, CategoricalIndex.get_indexer raises an error if if the index has duplicate values; so I think to keep this benchmark we need to use this smaller index.

You can still have more unique strings when combining multiple characters (which is what rands does). Just want to be sure that it's not a tiny benchmark that might not catch actual regressions

Ah good point. Here's a short timeit of this benchmark.

In [3]: %timeit idx.get_indexer(['a', 'c']) 736 µs ± 3.34 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you think this is too tiny I can bump it up, though I think asv regressions are reported as a percentage correct?

I think that's too little. Based on a quick profile, we are mostly measuring here the overhead (converting ['a', 'c'] to an Index, checking the dtypes, etc), which doesn't necessarily scale linearly with the size of the data. While the actual indexing operation is only about 10% of the time. So assume that we have a regression there that makes it go x2, we would hardly notice that in this benchmark's timing.

Now, we would maybe also solve that by making cat_list longer, instead of making idx longer (eg int_list has 10000 elements, cat_list only 2. Doing cat_list = ['a', 'c'] * 5000 would probably already help.

jreback · 2020-12-16T01:01:47Z

thanks @mroeschke

Matt Roeschke added 2 commits December 14, 2020 14:10

BENCH: Fix CategoricalIndexing benchmark

6fd19bc

Try always running benchmarks

6c0c230

mroeschke mentioned this pull request Dec 14, 2020

BENCH/REF: parametrize CSV benchmarks on engine #38442

Merged

5 tasks

Merge remote-tracking branch 'upstream/master' into fix/categorical_i…

542a7fc

…ndexing_benchmark

mroeschke added the Benchmark Performance (ASV) benchmarks label Dec 15, 2020

mroeschke added this to the 1.3 milestone Dec 15, 2020

jorisvandenbossche reviewed Dec 15, 2020

View reviewed changes

jreback merged commit 1138f18 into pandas-dev:master Dec 16, 2020

mroeschke deleted the fix/categorical_indexing_benchmark branch December 16, 2020 01:33

mroeschke mentioned this pull request Dec 17, 2020

BENCH: Increase sample of CategoricalIndexIndexing.time_get_indexer_list benchmark #38545

Merged

5 tasks

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021

BENCH: Fix CategoricalIndexing benchmark (pandas-dev#38476)

c6b0878

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BENCH: Fix CategoricalIndexing benchmark #38476

BENCH: Fix CategoricalIndexing benchmark #38476

mroeschke commented Dec 14, 2020

jorisvandenbossche left a comment

jorisvandenbossche Dec 15, 2020

mroeschke Dec 15, 2020

jorisvandenbossche Dec 15, 2020

mroeschke Dec 16, 2020

jorisvandenbossche Dec 16, 2020

jreback commented Dec 16, 2020

BENCH: Fix CategoricalIndexing benchmark #38476

BENCH: Fix CategoricalIndexing benchmark #38476

Conversation

mroeschke commented Dec 14, 2020

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche Dec 15, 2020

Choose a reason for hiding this comment

mroeschke Dec 15, 2020

Choose a reason for hiding this comment

jorisvandenbossche Dec 15, 2020

Choose a reason for hiding this comment

mroeschke Dec 16, 2020

Choose a reason for hiding this comment

jorisvandenbossche Dec 16, 2020

Choose a reason for hiding this comment

jreback commented Dec 16, 2020