Skip to content

PERF: Improve SeriesGroupBy.value_counts performances with categorical values. #46940

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -677,6 +677,15 @@ def time_groupby_extra_cat_sort(self):
def time_groupby_extra_cat_nosort(self):
self.df_extra_cat.groupby("a", sort=False)["b"].count()

def time_series_groupby_value_counts_many_categories(self):
df = self.df_extra_cat[0:10**4]
df.groupby("b")["a"].value_counts()

def time_series_groupby_value_counts_few_categories(self):
df = self.df_extra_cat[0:10**4].copy()
df["a"] = df["a"].cat.remove_unused_categories()
df.groupby("b")["a"].value_counts()


class Datelike:
# GH 14338
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.5.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -468,6 +468,7 @@ Performance improvements
- Performance improvement when setting values in a pyarrow backed string array (:issue:`46400`)
- Performance improvement in :func:`factorize` (:issue:`46109`)
- Performance improvement in :class:`DataFrame` and :class:`Series` constructors for extension dtype scalars (:issue:`45854`)
- Performance improvement in :meth:`SeriesGroupBy.value_counts` with categorical values. (:issue:`46202`)

.. ---------------------------------------------------------------------------
.. _whatsnew_150.bug_fixes:
Expand Down
31 changes: 27 additions & 4 deletions pandas/core/groupby/generic.py
Original file line number Diff line number Diff line change
Expand Up @@ -609,12 +609,35 @@ def value_counts(

names = self.grouper.names + [self.obj.name]

if is_categorical_dtype(val.dtype) or (
bins is not None and not np.iterable(bins)
):
if is_categorical_dtype(val.dtype):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just call the appropriate implementation. creating a new grouper and calling like this just obfuscatest he code. would really prefer simply to have a single implementation and just call it.

Copy link
Contributor Author

@LucasG0 LucasG0 May 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand why it obfuscates the code, and I wonder what would be the approriate implementation here. Do you mean introducing a method _groupby_value_counts containing most of the current DataFrame implementation logic, that would be called by both Series and DataFrame implementation ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback could you please elaborate what you think would be the approriate implementation to call here ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to refactor the value counts to a separate function then just call it for frame and series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After investigation, I think the common implementation should rely on a groupby.size (ie similar to the current DataFrame implementation) instead of the logic of the current Series implementation which looks more cumbersome. However, the DataFrame implementation relying on groupby.size holds a bug when grouping on a grouper with a freq, cf #47286. Therefore, having such a common implementation to Series/DataFrame would break test_series_groupby_value_counts_with_grouper.

df = self.obj.to_frame()
df.columns = Index([self.obj.name])
# GH38672 relates to categorical dtype
groupby = DataFrameGroupBy(
df,
self.grouper,
axis=self.axis,
level=self.level,
grouper=self.grouper,
exclusions=self.exclusions,
as_index=self.as_index,
sort=self.sort,
group_keys=self.group_keys,
squeeze=self.squeeze,
observed=self.observed,
mutated=self.mutated,
dropna=self.dropna,
)
ser = groupby.value_counts(
normalize=normalize, sort=sort, ascending=ascending
)
ser.name = self.obj.name
return ser

if bins is not None and not np.iterable(bins):
# scalar bins cannot be done at top level
# in a backward compatible way
# GH38672 relates to categorical dtype

ser = self.apply(
Series.value_counts,
normalize=normalize,
Expand Down