-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: option to force slow code path (don't call apply function 1 too many times) in GroupBy.apply #2936
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I believe think this is a duplicate of #2656 and not a bug... The function is called twice for internal implementation reasons but the results are correct. Possibly the docs should be updated to indicate that func passed to |
I see. I agree with the suggestion on the other thread that there should be a way I can tell apply that there is no side effects so that it does not have to run the first group twice. At the very least, it shouldn't run the first group twice if there is only one group. |
hmm...well, what do you actually need guaranteed and why? just that the function is only called once per group? (or do you need a specific ordering, too?) there's no way to add an option "i want to take the fast path always" because the fast path makes certain assumptions about memory aliasing and such and can cause segfault if they're incorrect. plus, the fast path implementation could change or an even faster path could be implemented, all of which are internal implementation details. so it only makes sense to allow an option to force it to take the slow path, and the only use case for that would be if you depended on the order or number of calls (which would only matter if your function had side-effects, not vice-versa, unless your apply function is very expensive and you're worried about the CPU cycles...). is there a specific reason you need that? |
The case I am came across, is expensive apply function with few groups. My apply takes 30 seconds to do so running the first group twice adds thirty seconds to the runtime. Though now that I know of this double run, I have changed my code to do the split-apply-concat step manually. I think I will try to use a global variable to just skip through the first run next time. |
ok, if your apply is that expensive that makes sense then. (so maybe it should be an option to force the basic path) |
Marked as enhancement for someday. Just have to get a groupby parameter to flow through |
Why, if I use a lambda expression, doesn't run twice the first iteration? >>> df = pd.DataFrame({"a":["x", "y"], "b":[1,2]})
>>> identity = lambda row: print(tuple(row))
>>> df2 = df.apply(identity, axis=1)
('x', 1)
('y', 2) |
I'd like to consider adopting dask's behavior here, where the user provides a
This has worked quite well for dask. |
i don’t think we need to do this at all |
Also using groupby.apply() on little number of groups, that sometimes can be just 1 depending on User inputs. In that particular case the code is 2x slower. An option to force the path would be quite appreciated. |
applym is called twice on the first group.
The text was updated successfully, but these errors were encountered: