Skip to content

PERF: Improve SeriesGroupBy.value_counts performances with categorical values. #46940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

LucasG0
Copy link
Contributor

@LucasG0 LucasG0 commented May 4, 2022

The performance issue comes from the fact the categorical path relies on GroupBy.apply. Actually, SeriesGroupby.value_counts and DataFrameGroupBy.value_counts have different implementations, and I feel they should both rely on a single one. Typically, the Series implementation looks more complicated than the DataFrame one (which relies on GroupBy.size), this is why this change creates a DataFrameGroupBy object to rely on the DataFrame implementation. Ideally, I think the entire Series implementation could rely on the DataFrame one (not only the categorical path). Also, I have a small doubt whether creating a new DataFrameGroupBy object out of a SeriesGroupBy one is a good practice or not, so if there is a better way to do here feel free to suggest it.

Note I did not add a test in this PR as it fixes performance only, but I manually tested and confirmed the performance improvement.

@LucasG0 LucasG0 force-pushed the series_groupby_value_counts_categorical_performance branch from 43d8e34 to 6aa38db Compare May 4, 2022 20:44
@jreback
Copy link
Contributor

jreback commented May 4, 2022

do we have asv's for this? can you show. (if not, or they are not representative of the OP can you add)

@jreback jreback added this to the 1.5 milestone May 4, 2022
@jreback jreback added Groupby Performance Memory or execution speed performance labels May 4, 2022
@LucasG0
Copy link
Contributor Author

LucasG0 commented May 6, 2022

do we have asv's for this? can you show. (if not, or they are not representative of the OP can you add)

It seems there is no asv for this. I will add one once I succeed to run the benchmark suite.

@LucasG0
Copy link
Contributor Author

LucasG0 commented May 14, 2022

After benchmarking, I noticed that the current DataFrameGroupBy.value_counts implementation might be slower then the SeriesGroupBy.value_counts with a high number of categories. Here are the results I got for the two benchmarks I added (cf this commit).

benchmarks

It shows a x30 performance boost for a small number of categories, and a x2 slow down with many categories.
I also noticed a 10x boost on OP's example (with a df size of 10_000 instead of 13_000_000).

I feel this fix is still interesting considering the slow down for many categories is negligible regarding the boost introduced for a small number of categories. I guess it depends on how many categories people use in general. @jreback do you have an opinion on it ?

Also, I profiled the DataFrameGroupBy.value_counts implementation to see where the time is spent, and it is mainly during sorting and _reindex_output.

snakeviz

if is_categorical_dtype(val.dtype) or (
bins is not None and not np.iterable(bins)
):
if is_categorical_dtype(val.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just call the appropriate implementation. creating a new grouper and calling like this just obfuscatest he code. would really prefer simply to have a single implementation and just call it.

Copy link
Contributor Author

@LucasG0 LucasG0 May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why it obfuscates the code, and I wonder what would be the approriate implementation here. Do you mean introducing a method _groupby_value_counts containing most of the current DataFrame implementation logic, that would be called by both Series and DataFrame implementation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback could you please elaborate what you think would be the approriate implementation to call here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to refactor the value counts to a separate function then just call it for frame and series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After investigation, I think the common implementation should rely on a groupby.size (ie similar to the current DataFrame implementation) instead of the logic of the current Series implementation which looks more cumbersome. However, the DataFrame implementation relying on groupby.size holds a bug when grouping on a grouper with a freq, cf #47286. Therefore, having such a common implementation to Series/DataFrame would break test_series_groupby_value_counts_with_grouper.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 9, 2022

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Jul 9, 2022
@LucasG0
Copy link
Contributor Author

LucasG0 commented Jul 15, 2022

I am no longer working on this pull request, I unassigned me of #46202.
I think the follow up on this issue would be to have a common implementation for both SeriesGroupBy and DataFrameGroupBy, but this refactoring seems to be not trivial, cf this comment.

@mroeschke
Copy link
Member

Thanks for your efforts so far @LucasG0. I'm going to close this PR as it has gone stale, but can reopen if anyone wants to tackle in the future

@mroeschke mroeschke closed this Aug 1, 2022
@mroeschke mroeschke removed this from the 1.5 milestone Aug 1, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: very slow groupby(col1)[col2].value_counts() for columns of type 'category'
3 participants