Skip to content

PERF: uses bincount instead of hash table in categorical value counts #10874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 22, 2015

Conversation

behzadnouri
Copy link
Contributor

closes #10804

In [1]: np.random.seed(2718281)

In [2]: n = 500000

In [3]: u = int(0.1*n)

In [4]: arr = ["s%04d" % i for i in np.random.randint(0, u, size=n)]

In [5]: ts = pd.Series(arr).astype('category')

In [6]: %timeit ts.value_counts()
10 loops, best of 3: 82.7 ms per loop

on branch:

In [6]: %timeit ts.value_counts()
10 loops, best of 3: 31.3 ms per loop

@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

I think the soln in the issues is faster than this no?

@behzadnouri
Copy link
Contributor Author

I get

In [9]: %timeit Series(np.arange(len(ts.cat.categories)),ts.cat.categories).map(ts.cat.codes.value_counts()).order(ascending=False)
10 loops, best of 3: 28.3 ms per loop

but this does not check for nulls, and the index is not categorical.

@behzadnouri
Copy link
Contributor Author

with dropna=False the branch performs better:

In [9]: %timeit ts.value_counts(dropna=True)
10 loops, best of 3: 32.3 ms per loop

In [10]: %timeit ts.value_counts(dropna=False)
10 loops, best of 3: 25.7 ms per loop

In [11]: %timeit Series(np.arange(len(ts.cat.categories)),ts.cat.categories).map(ts.cat.codes.value_counts()).order(ascending=False) 
10 loops, best of 3: 29.5 ms per loop

@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

wow, this does even better!

In [7]: %timeit ts.value_counts(dropna=True)
100 loops, best of 3: 11 ms per loop

In [8]: %timeit ts.value_counts(dropna=False)
100 loops, best of 3: 9.53 ms per loop

In [9]: %timeit Series(np.arange(len(ts.cat.categories)),ts.cat.categories).map(ts.cat.codes.value_counts()).order(ascending=False) 
100 loops, best of 3: 17.4 ms per loop

@jreback jreback added Performance Memory or execution speed performance Categorical Categorical Data Type labels Aug 21, 2015
@jreback jreback added this to the 0.17.0 milestone Aug 21, 2015
@jreback
Copy link
Contributor

jreback commented Aug 21, 2015

ping when green

@jorisvandenbossche
Copy link
Member

Maybe worth adding a benchmark?

@behzadnouri
Copy link
Contributor Author

I will add benchmark later today

@behzadnouri
Copy link
Contributor Author

added the benchmark, all green.

def time_value_counts(self):
self.ts.value_counts(dropna=True)
self.ts.value_counts(dropna=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should have only 1 action per timing function (so make 2 functions)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why should it be only 1 action?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You get a timing per function. So if you want to track performance of both with dropna True and False, it has to be in two functions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added separate calls

jreback added a commit that referenced this pull request Aug 22, 2015
PERF: uses bincount instead of hash table in categorical value counts
@jreback jreback merged commit 1cf18cd into pandas-dev:master Aug 22, 2015
@jreback
Copy link
Contributor

jreback commented Aug 22, 2015

thank you sir!

@behzadnouri behzadnouri deleted the cat-val-cnt branch August 22, 2015 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PERF: categorical value_counts can be much faster
3 participants