PERF: Improve SeriesGroupBy.value_counts performances with categorical values. #46940

LucasG0 · 2022-05-04T13:44:04Z

closes BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category' #46202
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

The performance issue comes from the fact the categorical path relies on GroupBy.apply. Actually, SeriesGroupby.value_counts and DataFrameGroupBy.value_counts have different implementations, and I feel they should both rely on a single one. Typically, the Series implementation looks more complicated than the DataFrame one (which relies on GroupBy.size), this is why this change creates a DataFrameGroupBy object to rely on the DataFrame implementation. Ideally, I think the entire Series implementation could rely on the DataFrame one (not only the categorical path). Also, I have a small doubt whether creating a new DataFrameGroupBy object out of a SeriesGroupBy one is a good practice or not, so if there is a better way to do here feel free to suggest it.

Note I did not add a test in this PR as it fixes performance only, but I manually tested and confirmed the performance improvement.

…orical

jreback · 2022-05-04T23:31:41Z

do we have asv's for this? can you show. (if not, or they are not representative of the OP can you add)

LucasG0 · 2022-05-06T15:37:48Z

do we have asv's for this? can you show. (if not, or they are not representative of the OP can you add)

It seems there is no asv for this. I will add one once I succeed to run the benchmark suite.

LucasG0 · 2022-05-14T13:18:56Z

After benchmarking, I noticed that the current DataFrameGroupBy.value_counts implementation might be slower then the SeriesGroupBy.value_counts with a high number of categories. Here are the results I got for the two benchmarks I added (cf this commit).

It shows a x30 performance boost for a small number of categories, and a x2 slow down with many categories.
I also noticed a 10x boost on OP's example (with a df size of 10_000 instead of 13_000_000).

I feel this fix is still interesting considering the slow down for many categories is negligible regarding the boost introduced for a small number of categories. I guess it depends on how many categories people use in general. @jreback do you have an opinion on it ?

Also, I profiled the DataFrameGroupBy.value_counts implementation to see where the time is spent, and it is mainly during sorting and _reindex_output.

jreback · 2022-05-15T03:03:36Z

pandas/core/groupby/generic.py

-        if is_categorical_dtype(val.dtype) or (
-            bins is not None and not np.iterable(bins)
-        ):
+        if is_categorical_dtype(val.dtype):


can we just call the appropriate implementation. creating a new grouper and calling like this just obfuscatest he code. would really prefer simply to have a single implementation and just call it.

I understand why it obfuscates the code, and I wonder what would be the approriate implementation here. Do you mean introducing a method _groupby_value_counts containing most of the current DataFrame implementation logic, that would be called by both Series and DataFrame implementation ?

@jreback could you please elaborate what you think would be the approriate implementation to call here ?

need to refactor the value counts to a separate function then just call it for frame and series

After investigation, I think the common implementation should rely on a groupby.size (ie similar to the current DataFrame implementation) instead of the logic of the current Series implementation which looks more cumbersome. However, the DataFrame implementation relying on groupby.size holds a bug when grouping on a grouper with a freq, cf #47286. Therefore, having such a common implementation to Series/DataFrame would break test_series_groupby_value_counts_with_grouper.

github-actions · 2022-07-09T00:05:03Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

LucasG0 · 2022-07-15T21:43:35Z

I am no longer working on this pull request, I unassigned me of #46202.
I think the follow up on this issue would be to have a common implementation for both SeriesGroupBy and DataFrameGroupBy, but this refactoring seems to be not trivial, cf this comment.

mroeschke · 2022-08-01T20:24:35Z

Thanks for your efforts so far @LucasG0. I'm going to close this PR as it has gone stale, but can reopen if anyone wants to tackle in the future

PERF: SeriesGroupBy.value_counts no longer relies on apply with categ…

6aa38db

…orical

LucasG0 force-pushed the series_groupby_value_counts_categorical_performance branch from 43d8e34 to 6aa38db Compare May 4, 2022 20:44

jreback added this to the 1.5 milestone May 4, 2022

jreback added Groupby Performance Memory or execution speed performance labels May 4, 2022

Add asv benchmarks

cd20580

jreback requested changes May 15, 2022

View reviewed changes

LucasG0 mentioned this pull request Jun 8, 2022

BUG: DataFrameGroupBy.value_counts when grouper has a frequency #47286

Closed

3 tasks

github-actions bot added the Stale label Jul 9, 2022

mroeschke closed this Aug 1, 2022

mroeschke removed this from the 1.5 milestone Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improve SeriesGroupBy.value_counts performances with categorical values. #46940

PERF: Improve SeriesGroupBy.value_counts performances with categorical values. #46940

LucasG0 commented May 4, 2022 •

edited

Loading

jreback commented May 4, 2022 •

edited

Loading

LucasG0 commented May 6, 2022

LucasG0 commented May 14, 2022

jreback May 15, 2022

LucasG0 May 16, 2022 •

edited

Loading

LucasG0 May 22, 2022

jreback May 22, 2022

LucasG0 Jun 8, 2022

github-actions bot commented Jul 9, 2022

LucasG0 commented Jul 15, 2022

mroeschke commented Aug 1, 2022

PERF: Improve SeriesGroupBy.value_counts performances with categorical values. #46940

PERF: Improve SeriesGroupBy.value_counts performances with categorical values. #46940

Conversation

LucasG0 commented May 4, 2022 • edited Loading

jreback commented May 4, 2022 • edited Loading

LucasG0 commented May 6, 2022

LucasG0 commented May 14, 2022

jreback May 15, 2022

Choose a reason for hiding this comment

LucasG0 May 16, 2022 • edited Loading

Choose a reason for hiding this comment

LucasG0 May 22, 2022

Choose a reason for hiding this comment

jreback May 22, 2022

Choose a reason for hiding this comment

LucasG0 Jun 8, 2022

Choose a reason for hiding this comment

github-actions bot commented Jul 9, 2022

LucasG0 commented Jul 15, 2022

mroeschke commented Aug 1, 2022

LucasG0 commented May 4, 2022 •

edited

Loading

jreback commented May 4, 2022 •

edited

Loading

LucasG0 May 16, 2022 •

edited

Loading