-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
groupby - apply applies to the first group twice #7739
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
http://pandas.pydata.org/pandas-docs/stable/groupby.html#flexible-apply see the warning |
See #6753 as well |
My apologies, I guess I was looking at the dev version of the docs, which lacks the warning? I'm not really sure how I wound up there, though. Maybe it is worth documenting what influences the choice to take the "fast path" or the "slow path". That sounds important, from an optimization standpoint. If I have code that could work with the "fast path", I'd hate for it to decide to use the "slow path" just because I coded it one way when I could have coded it another. |
you were looking at 0.13.1 docs certainly would take a pr for an update in a nutshell:
so it's not 100% obvious what happens just by examing code, that's why a trial happens |
Can I provide hints, or even manually select the path I think applies (risking an exception or even crash if I choose wrong)? I would really like to avoid the extra invocation of my function, because it is expensive. At present, the best I can do seems to be to deliberately raise an exception on the first call (e.g. by asking for the group's To clarify, I'm performing a relatively expensive computation involving several other columns, and creating a few new columns which I add to the |
don't use apply instead something like: concat([ func(c) for c, col in df.iteritems() ], axis=1) might work w/I showing what you are doing its just guessing |
No, that won't work at all. I will try to be clearer. I'm actually computing the most-likely state sequence from a hidden markov model, where each group is treated completely independently. Each row in the In another case, I need to do things like compute the group-wise mean of one column, and compare that to the value in another column. In both cases, Unless I've misunderstood something, the |
code is worth a 1000 words here |
OK, this is the one that isn't so expensive, but it doesn't require me to copy my numba implementation of the Viterbi algorithm, which is not really relevant.
Then, I have a DataFrame that has three new columns, You can see that I need all the columns at once, and I need to add a column, and I need to handle things on a group-by-group basis. Is this clearer? I suppose that, strictly speaking, I don't need to add a column. I just need a sequence of homogenously-typed data elements that has an exact item-to-row correspondence with the existing data frame. But, if I had that, then I could trivially add that as a column, so saying "I need to add a column" is just shorthand. I hope that's clear. |
Your example is not reproducible, try something like this to generate your data. A copy-paste example is a must for a problem like this.
Most of the operations can simply be refactored out of the groupby, e.g. This is
you should simply use Try to do as much as possible outside the groupby, and then use the cythonized ops to make this faster |
I think we're talking at cross purposes. I'm asking a general question (and making a general suggestion and/or feature request), not asking for help optimizing my code. You seemed to fundamentally misunderstand what I was talking about (the Here is some code which illustrates the general problem without providing specifics to get distracted by:
By "effectful", I mean something like disk writes, not modifications to the preexisting columns of To put it another way, the duplication of the first run is to allow And, lest I come off as too annoyed, demanding, or ungrateful, let me take the opportunity to thank all the pandas developers/contributors. It is really a wonderful tool, and I use it almost daily. |
generally I see comments about wanting to do something when in fact the user is not approach in the problem in a pandonic / efficient manner. that's why seeing runnable code is useful. you. an take my suggestion am and use them (or not). as I said you should do things outside of the groupby as much as possible this may help as well: http://pandas.pydata.org/pandas-docs/stable/groupby.html#iterating-through-groups as far as generally make an option available to chose which path pandas takes is self defeating and too complicated. it's an implementation detail. you can simply look ok the code if u care or use a groupy iteration if u want to completely avoid side effects ,which it seems u do) no worries otherwise - trying just to make pandas generally useful with a consistent and simple as possible API (which are of course frequently at cross purposes) |
It's a very slick, high-level API. But it tries to guess what the user wants, and sometimes this results in very unexpected behavior that is quite difficult to work around without the user including snippets of code that are completely unrelated to what the user wants to do. Personally, I am not a fan of APIs that try to be too smart. I know what I want, and I want to tell the computer to do what I want. So, feature request: could you please provide a lower-level API that would allow more direct control over the flow of execution? Just a simple Currently, I'm working up a minimal working example of my other use-case for |
I see that there is an old enhancement request open to add the Alternatively, in some cases it might be possible to capture the results of Maybe at some point I'll even take a crack at implementing these things myself. FWIW, I don't buy the argument that these things are an "implementation detail" -- The way in which my code is called is very important to me, and that brings it quite a bit beyond the level of "implementation detail". |
well pull requests are welcome and it IS an implementation detail if u abide by the API of the groupby and don't have side effects then their is no issue if u want to have your function called in a particular way then go ahead and don't use apply |
I think the way to do this is to add a parameter then the user can change that if wanted using the already computed group is done in transform iirc |
This seems like a good possibility to me. I would really appreciate it. It might also help to emit a warning when fastpath is attempted and fails. IPython has a nice facility for displaying a warning only the first time it appears. It is quite nice. I don't know exactly how it works, but it seems to be automatic. |
This bug is easy to reproduce:
This should print
and return a
Series
containing two zeroes. Although it does return the expectedSeries
, it printsSo, the function
printA
is being applied to the first group twice. This doesn't seem to affect the eventual returned result of apply, but it duplicates any side effects, and ifprintA
performed some expensive computation, it would increased the time to completion.The text was updated successfully, but these errors were encountered: