-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: fix #32976 slow group by for categorical columns #33739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/core/groupby/generic.py
Outdated
result, _ = self.grouper.aggregate( | ||
block.values, how, axis=1, min_count=min_count | ||
|
||
cat_method_blacklist = ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I copied this list from asv_bench.benchmarks.groupby.method_blacklist['object']
and appended the "add" method. Is there a better way of blacklisting these methods which shouldn't be applied to categorical codes?
tests.groupby.test_function.test_arg_passthru
is an example of a test failure without this blacklisting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbrockmendel how close do you think we are to defining an API to allow passing EA values to cython arrays?
# ignore the method names.
values = self._values_for_cython(method="first")
# ... operation on values
result = self._result_from_cython(..., dtype=self.dtype method="first")
For categorical, that method returns codes
and result_from_cython
is from_codes
.
pandas/core/groupby/generic.py
Outdated
categories=block.values.categories, | ||
ordered=block.values.ordered, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
categories=block.values.categories, | |
ordered=block.values.ordered, | |
) | |
dtype=block.values.dtype | |
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually tried this and found that it raises a ValueError here: Cannot specify categories
or ordered
together with dtype
If we restrict attention to ordering-based cython methods then i think this is pretty straightforward, just need to get everyone on board. |
696508a
to
0340554
Compare
pandas/core/groupby/ops.py
Outdated
@@ -472,6 +473,29 @@ def _cython_operation( | |||
|
|||
is_datetimelike = needs_i8_conversion(values.dtype) | |||
is_numeric = is_numeric_dtype(values.dtype) | |||
is_categorical = is_categorical_dtype(values) | |||
cat_method_blacklist = ( | |||
"add", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need a blacklist at all? you are already only operating on the codes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this method is passed categorical values with a "how" like mean then we don't want to average the codes as if they were ints -- this would happen in the pandas.tests.groupby.test_function.test_arg_passthru
test without the blacklist, for example. the logic for cases these blacklist cases is already handled in higher in the call stack in pandas.core.groupby.generic.DataFrameGroupBy._cython_agg_blocks
but maybe we could move some of that logic here for conciseness and/or perf later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think you need this at all, what breaks if you take it out entirely (the blackist)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test will fail without the blacklist: pandas.tests.groupby.test_function.test_arg_passthru
. Here's a minimal example that will fail:
import pandas as pd
df = pd.DataFrame(
{"group": [1, 1, 2], "category_string": pd.Series(list("abc")).astype("category")},
columns=["group", "category_string"],
)
df.groupby("group").mean(numeric_only=False)
Nothing else fails so perhaps this test can be updated in lieu of the blacklist, not sure if this is fine from an interface perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think its ok to simply update this test
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It really seems like we need this blacklist anytime the cython op does something with the actual values then we can't pass the codes there. For example something like mean
or sum
on a numeric categorical.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. ok then, let's move this to an attribute on the class, or a function maybe better for easier re-sue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to class level in 79f0c72. We could easily make this accessible outside the class with an additional method w/ property decorator
@@ -466,7 +466,7 @@ def test_agg_cython_category_not_implemented_fallback(): | |||
result = df.groupby("col_num").col_cat.first() | |||
expected = pd.Series( | |||
[1, 2, 3], index=pd.Index([1, 2, 3], name="col_num"), name="col_cat" | |||
) | |||
).astype("category") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The _cython_operation
method in this commit returns a categorical if passed a categorical. In this test the "col_cat" column is categorical so it seems intuitive to me that when aggregated over it should also be a category.
The fact that this test was passing before suggests that this is not also true in the higher order functions like DataFrameGroupBy._cython_agg_blocks
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change needs a dedicated whatsnew entry.
Agreed that preserving the dtype is probably the right choice, but this is a relatively large change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, just added a new entry in the "Other API changes" section in commit 6498d6b
pls show the performance before / after. |
asv_bench/benchmarks/groupby.py
Outdated
) | ||
self.df_cat_values = df_int.astype({"cat": CAT}) | ||
|
||
def time_groupby(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likely need to paramaterize this over a bunch of functions. assume this is not too slow to do that. if so pls reduce the size of the benchmark to make it reasonable to parameterize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this takes 10.2 ms ± 28.7 µs per loop on the current commit on my desktop via timeit but 7.14 s ± 31.3 ms per loop on master. I think 10ms should be ok but please let me know otherwise
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is fine, but i need you to paramaterize over multiple functions, doesn't have to be every one but represententative ones (e.g. reductions, transforms and filters)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
paramaterized over column types in b4648d5
latest asv results for new benchmark replicating #32976:
|
asv_bench/benchmarks/groupby.py
Outdated
) | ||
self.df_cat_values = df_int.astype({"cat": CAT}) | ||
|
||
def time_groupby(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is fine, but i need you to paramaterize over multiple functions, doesn't have to be every one but represententative ones (e.g. reductions, transforms and filters)
pandas/core/groupby/ops.py
Outdated
@@ -472,6 +473,29 @@ def _cython_operation( | |||
|
|||
is_datetimelike = needs_i8_conversion(values.dtype) | |||
is_numeric = is_numeric_dtype(values.dtype) | |||
is_categorical = is_categorical_dtype(values) | |||
cat_method_blacklist = ( | |||
"add", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think you need this at all, what breaks if you take it out entirely (the blackist)
4398706
to
b4648d5
Compare
asv_bench/benchmarks/groupby.py
Outdated
@@ -510,6 +512,33 @@ def time_groupby_extra_cat_nosort(self): | |||
self.df_extra_cat.groupby("a", sort=False)["b"].count() | |||
|
|||
|
|||
class CategoricalFrame: | |||
# benchmark grouping with operations on categorical values (GH #32976) | |||
param_names = ["groupby_type", "value_type"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oarameterize over
mean and head as well
i would also reduce the number of groups as well
it will still show an appreciable diff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. but I swapped mean for count since mean requires numeric types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i also removed the str groupby parameter option to limit the number of tests but the benchmark results are the same either way for this diff.
here are the latest results:
before after ratio
[22cf0f5d] [a16f6a23]
<master>
- 167±1ms 2.72±0.09ms 0.02 groupby.CategoricalFrame.time_groupby(<class 'int'>, <class 'str'>, 'last')
- 168±2ms 2.66±0.07ms 0.02 groupby.CategoricalFrame.time_groupby_ordered(<class 'int'>, <class 'str'>, 'last')
- 174±2ms 2.65±0.09ms 0.02 groupby.CategoricalFrame.time_groupby_ordered(<class 'int'>, <class 'int'>, 'last')
- 175±2ms 2.61±0.05ms 0.01 groupby.CategoricalFrame.time_groupby(<class 'int'>, <class 'int'>, 'last')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.
6498d6b
to
9377e8d
Compare
pandas/core/groupby/ops.py
Outdated
@@ -472,6 +473,29 @@ def _cython_operation( | |||
|
|||
is_datetimelike = needs_i8_conversion(values.dtype) | |||
is_numeric = is_numeric_dtype(values.dtype) | |||
is_categorical = is_categorical_dtype(values) | |||
cat_method_blacklist = ( | |||
"add", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah i think its ok to simply update this test
@rtlee9 Can you merge master and fix the conflict? |
Fixed conflict and merged in 7570425 |
doc/source/whatsnew/v1.1.0.rst
Outdated
@@ -239,7 +239,7 @@ Other API changes | |||
- ``loc`` lookups with an object-dtype :class:`Index` and an integer key will now raise ``KeyError`` instead of ``TypeError`` when key is missing (:issue:`31905`) | |||
- Using a :func:`pandas.api.indexers.BaseIndexer` with ``count``, ``min``, ``max``, ``median``, ``skew``, ``cov``, ``corr`` will now return correct results for any monotonic :func:`pandas.api.indexers.BaseIndexer` descendant (:issue:`32865`) | |||
- Added a :func:`pandas.api.indexers.FixedForwardWindowIndexer` class to support forward-looking windows during ``rolling`` operations. | |||
- | |||
- :meth:`DataFrame.groupby` aggregations of categorical series will now return a :class:`Categorical` while preserving the codes and categories of the original series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this some other issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a (desirable) side effect of this PR - please find more context in this thread.
@@ -494,6 +519,17 @@ def _cython_operation( | |||
values = ensure_int_or_float(values) | |||
elif is_numeric and not is_complex_dtype(values): | |||
values = ensure_float64(values) | |||
elif is_categorical: | |||
if how in self._cat_method_blacklist: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really don't like doing this. Can you elaborate when we can actually process this? listing methods is a bad idea generally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not exactly sure what you mean by "when we can actually process this" this but I agree that listing methods isn't necessarily thorough and isn't robust. However, I've been unable to find a suitable alternative to blacklisting methods where we don't want to apply the aggregation on the category codes -- open to other ideas though. Please find more context in this thread.
@rtlee9 if you'd address the comments and merge master |
acc330a
to
46e1e8e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if u can merge master
i don't really like the specific method blacklist - if u can find a better way
@rtlee9 can you move release note to 1.2 and merge upstream/master to resolve conflict |
964ac84
to
725c6d2
Compare
doc/source/whatsnew/v1.2.0.rst
Outdated
@@ -132,6 +133,7 @@ Plotting | |||
Groupby/resample/rolling | |||
^^^^^^^^^^^^^^^^^^^^^^^^ | |||
|
|||
- :meth:`DataFrame.groupby` aggregations of categorical series will now return a :class:`Categorical` while preserving the codes and categories of the original series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this PR as the issue (or relevant issue if discussed elsewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the PR as the issue in commit 0fd3dde
Aggregate categorical codes with fast cython aggregation for select `how` operations. 8/1/20: rebase and move release note to 1.2 8/2/20: Update tests to expect categorical back 8/3/20: add PR as issue for whatsnew groupby api change
@rtlee9 Is this still active? Some merge conflicts that need resolving. |
@@ -356,6 +357,29 @@ def get_group_levels(self) -> List[Index]: | |||
|
|||
_name_functions = {"ohlc": ["open", "high", "low", "close"]} | |||
|
|||
_cat_method_blacklist = ( | |||
"add", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what methods don't work?
Closing for now. @rtlee9 ping us whenever you'd like to continue and we'll reopen! |
Aggregate categorical codes with fast cython aggregation for select
how
operations. Added new ASV benchmark copied from 32976 indicating > 99% improvement in performance for this case.black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff