PERF: uses bincount instead of hash table in categorical value counts #10874

behzadnouri · 2015-08-21T00:40:06Z

In [1]: np.random.seed(2718281)

In [2]: n = 500000

In [3]: u = int(0.1*n)

In [4]: arr = ["s%04d" % i for i in np.random.randint(0, u, size=n)]

In [5]: ts = pd.Series(arr).astype('category')

In [6]: %timeit ts.value_counts()
10 loops, best of 3: 82.7 ms per loop

on branch:

In [6]: %timeit ts.value_counts()
10 loops, best of 3: 31.3 ms per loop

jreback · 2015-08-21T00:47:17Z

I think the soln in the issues is faster than this no?

behzadnouri · 2015-08-21T01:02:53Z

I get

In [9]: %timeit Series(np.arange(len(ts.cat.categories)),ts.cat.categories).map(ts.cat.codes.value_counts()).order(ascending=False)
10 loops, best of 3: 28.3 ms per loop

but this does not check for nulls, and the index is not categorical.

behzadnouri · 2015-08-21T01:34:11Z

with dropna=False the branch performs better:

In [9]: %timeit ts.value_counts(dropna=True)
10 loops, best of 3: 32.3 ms per loop

In [10]: %timeit ts.value_counts(dropna=False)
10 loops, best of 3: 25.7 ms per loop

In [11]: %timeit Series(np.arange(len(ts.cat.categories)),ts.cat.categories).map(ts.cat.codes.value_counts()).order(ascending=False) 
10 loops, best of 3: 29.5 ms per loop

jreback · 2015-08-21T13:19:02Z

wow, this does even better!

In [7]: %timeit ts.value_counts(dropna=True)
100 loops, best of 3: 11 ms per loop

In [8]: %timeit ts.value_counts(dropna=False)
100 loops, best of 3: 9.53 ms per loop

In [9]: %timeit Series(np.arange(len(ts.cat.categories)),ts.cat.categories).map(ts.cat.codes.value_counts()).order(ascending=False) 
100 loops, best of 3: 17.4 ms per loop

jreback · 2015-08-21T13:20:25Z

ping when green

jorisvandenbossche · 2015-08-21T14:36:00Z

Maybe worth adding a benchmark?

behzadnouri · 2015-08-21T18:04:30Z

I will add benchmark later today

behzadnouri · 2015-08-22T11:20:37Z

added the benchmark, all green.

jreback · 2015-08-22T15:41:52Z

asv_bench/benchmarks/categoricals.py

+    def time_value_counts(self):
+        self.ts.value_counts(dropna=True)
+        self.ts.value_counts(dropna=False)


These should have only 1 action per timing function (so make 2 functions)

why should it be only 1 action?

You get a timing per function. So if you want to track performance of both with dropna True and False, it has to be in two functions.

added separate calls

PERF: uses bincount instead of hash table in categorical value counts

jreback · 2015-08-22T20:03:06Z

thank you sir!

behzadnouri force-pushed the cat-val-cnt branch from 24f1e3f to 436e96e Compare August 21, 2015 11:47

jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Aug 21, 2015

jreback added this to the 0.17.0 milestone Aug 21, 2015

behzadnouri force-pushed the cat-val-cnt branch from 436e96e to 855b804 Compare August 21, 2015 22:52

jreback reviewed Aug 22, 2015
View reviewed changes

PERF: uses bincount instead of hash table in categorical value counts

c5a47e3

behzadnouri force-pushed the cat-val-cnt branch from 855b804 to c5a47e3 Compare August 22, 2015 16:15

jreback added a commit that referenced this pull request Aug 22, 2015

Merge pull request #10874 from behzadnouri/cat-val-cnt

1cf18cd

PERF: uses bincount instead of hash table in categorical value counts

jreback merged commit 1cf18cd into pandas-dev:master Aug 22, 2015

behzadnouri deleted the cat-val-cnt branch August 22, 2015 20:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: uses bincount instead of hash table in categorical value counts #10874

PERF: uses bincount instead of hash table in categorical value counts #10874

behzadnouri commented Aug 21, 2015

jreback commented Aug 21, 2015

behzadnouri commented Aug 21, 2015

behzadnouri commented Aug 21, 2015

jreback commented Aug 21, 2015

jreback commented Aug 21, 2015

jorisvandenbossche commented Aug 21, 2015

behzadnouri commented Aug 21, 2015

behzadnouri commented Aug 22, 2015

jreback Aug 22, 2015

behzadnouri Aug 22, 2015

jorisvandenbossche Aug 22, 2015

behzadnouri Aug 22, 2015

jreback commented Aug 22, 2015

PERF: uses bincount instead of hash table in categorical value counts #10874

PERF: uses bincount instead of hash table in categorical value counts #10874

Conversation

behzadnouri commented Aug 21, 2015

jreback commented Aug 21, 2015

behzadnouri commented Aug 21, 2015

behzadnouri commented Aug 21, 2015

jreback commented Aug 21, 2015

jreback commented Aug 21, 2015

jorisvandenbossche commented Aug 21, 2015

behzadnouri commented Aug 21, 2015

behzadnouri commented Aug 22, 2015

jreback Aug 22, 2015

Choose a reason for hiding this comment

behzadnouri Aug 22, 2015

Choose a reason for hiding this comment

jorisvandenbossche Aug 22, 2015

Choose a reason for hiding this comment

behzadnouri Aug 22, 2015

Choose a reason for hiding this comment

jreback commented Aug 22, 2015