-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REGR: cumsum regression with groupby call to agg #31802
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@mattharrison Thanks for the clear report! I can confirm this is a regression. cc @jbrockmendel this is another example of something failing in libreduction with another error than the specific one that is catched: pandas/pandas/core/groupby/ops.py Lines 640 to 648 in ca84bd0
|
@jbrockmendel can you take a look? |
We expect cumsum().max() to be a reduction, but when we go through libreduction, on the last group it doesn't reduce. This suggests that we could catch this within lib-reduction and re-raise as |
Can you explain this a bit more? The function is doing a |
I, too, expect it to be a reduction. I added in Block.apply a
so maybe it is reducing, just along the wrong axis? |
A bit simpler reproducer: In [43]: df = pd.DataFrame({"A": ['a', 'a', 'a'], "B": ['a', 'b', 'b'], "C": [1, 1, 1]})
In [44]: def f(x):
...: assert len(x) == len(x._data.blocks[0].mgr_locs)
...: return 0
...:
In [45]: df.groupby(["A", "B"]).agg(f)
---------------------------------------------------------------------------
AssertionError Traceback (most recent call last)
<ipython-input-45-6faf28de6e4f> in <module>
----> 1 df.groupby(["A", "B"]).agg(f)
~/sandbox/pandas/pandas/core/groupby/generic.py in aggregate(self, func, *args, **kwargs)
937 # grouper specific aggregations
938 if self.grouper.nkeys > 1:
--> 939 return self._python_agg_general(func, *args, **kwargs)
940 elif args or kwargs:
941 result = self._aggregate_frame(func, *args, **kwargs)
~/sandbox/pandas/pandas/core/groupby/groupby.py in _python_agg_general(self, func, *args, **kwargs)
924 try:
925 # if this function is invalid for this dtype, we will ignore it.
--> 926 result, counts = self.grouper.agg_series(obj, f)
927 except TypeError:
928 continue
~/sandbox/pandas/pandas/core/groupby/ops.py in agg_series(self, obj, func)
638
639 try:
--> 640 return self._aggregate_series_fast(obj, func)
641 except ValueError as err:
642 if "Function does not reduce" in str(err):
~/sandbox/pandas/pandas/core/groupby/ops.py in _aggregate_series_fast(self, obj, func)
663 group_index = algorithms.take_nd(group_index, indexer, allow_fill=False)
664 grouper = libreduction.SeriesGrouper(obj, func, group_index, ngroups, dummy)
--> 665 result, counts = grouper.get_result()
666 return result, counts
667
~/sandbox/pandas/pandas/_libs/reduction.pyx in pandas._libs.reduction.SeriesGrouper.get_result()
373 cached_typ, cached_ityp, islider, vslider)
374
--> 375 res, initialized = self._apply_to_group(cached_typ, cached_ityp,
376 islider, vslider,
377 group_size, initialized)
~/sandbox/pandas/pandas/_libs/reduction.pyx in pandas._libs.reduction._BaseGrouper._apply_to_group()
191 """
192 cached_ityp._engine.clear_mapping()
--> 193 res = self.f(cached_typ)
194 res = _extract_result(res)
195 if not initialized:
~/sandbox/pandas/pandas/core/groupby/groupby.py in <lambda>(x)
911 def _python_agg_general(self, func, *args, **kwargs):
912 func = self._is_builtin_func(func)
--> 913 f = lambda x: func(x, *args, **kwargs)
914
915 # iterate through "columns" ex exclusions to populate output dict
<ipython-input-44-5151b7117ae2> in f(x)
1 def f(x):
----> 2 assert len(x) == len(x._data.blocks[0].mgr_locs)
3 return 0
4
AssertionError: I think the issue is roughly somewhere around pandas/pandas/_libs/reduction.pyx Line 179 in 54c5e9e
I'm not 100% sure what the solution is. It seems weird to mutate the mgr_locs in libreduction. Do you have any thoughts on that @jbrockmendel? |
That can only end in tears. I'd be inclined to failover to the non-fastpath pretty aggressively. |
Yeah, that's my preference as well. By "weird" I should have probably put "crazy" :) I'll see if I can get a somewhat targeted check for when we need to bail out. |
Right now I'm pessimistic about a "targeted check" being possible. It seems like I think our only two choices are to
|
I suppose the 3rd choice is to continue to play whack-a-mole on exceptions / messages in pandas/pandas/core/groupby/ops.py Lines 639 to 647 in 787dc8a
|
Closes pandas-dev#31802 This "fixes" pandas-dev#31802 by expanding the number of cases where we swallow an exception in libreduction. Currently, we're creating an invalid Series in SeriesBinGrouper where the `.mgr_locs` doesn't match the values. See pandas-dev#31802 (comment) for more. For now, we simply catch more cases that fall back to Python. I've gone with a minimal change which addresses only issues hitting this exact exception. We might want to go broader, but that's not clear.
Maintenance-wise, this is by far the most appealing option, xref #32086. As of that PR, the performance penalty (as measured by our most-affected asv) of disabling fast_apply was 48x. Some perf PRs since then have gotten that down to 7x. I'm pretty sure I can shave that down to 5x without too much trouble. Leaving aside the issue of one asv not being a representative sample, the question is how big a perf penalty are we willing to accept to get rid of this hassle? |
I'm a little confused here. How are we getting to a point where vslider.buf has the wrong shape? |
What's the difference maintenance wise between try/except the fast version (with a generic |
|
we could use something like the |
But we need the cython code anyway? Or would we be able to remove quite some of it if it didn't need to support UDFs? |
If we got rid of fast_apply, we could rip out 136 lines of cython and about 60 more of non-cython, see #32086. Also it looks like fast/slow paths give different behavior for tshift in tests.groupby.test_whitelist |
@TomAugspurger after having looked at it more closely, updating mgr_locs is viable. Adding |
Hmmm OK. Let's dig this hole a little deeper then... |
Any thoughts here? I've got it down to 4.5x locally |
4.5x seems a bit rough (is that across the board, or just specific cases?) I think I'd prefer to see something like what you proposed earlier:
It seems safeish to be doing these hacks on plain ndarrays. I think most of the fragility comes from trying to operate on more complex objects like Series & DataFrame. If we could ensure that the callable is only operating on ndarrays (via an opt-in keyword like |
This seems like a good tradeoff.
A few weeks ago I ran the asvs with fast_apply disabled and chose the one that was most affected. Have been using that for comparison in optimization PRs since then. Most of the slowdown comes from slicing because we create new Series+SingleBlockManager+Index+Block objects on each of ~2k groups |
Do you have a sense (without looking) for whether we're doing "pandas things" (accessing the index, blocks, etc.) in those methods? I guess that determines whether the |
which methods are you referring to? |
The callable being applied. If I'm thinking about this the right way, that callable would start receiving an ndarray rather than a Series. |
Gotcha. Yah, passing an ndarray to the UDF would allow for a lot of simplication |
Code Sample, a copy-pastable example if possible
I want to define a custom function that I can pass to the
agg
method. It uses thecumsum
method, which appears to be problematic recently.Problem description
Prior to Pandas 1.0rc this worked. It now raises an exception:
Expected Output
Output of
pd.show_versions()
The text was updated successfully, but these errors were encountered: