-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
The apply method over a group seems to visit the first group twice #2656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Yes. The issue is that the groupby infrastructure can take a very fast code path if the function returns a new object compared with a modified version of the chunk or a view on the chunk. This is new in 0.10-- may make sense to add an option to |
Hm, Wes, maybe I'm missing a point here, but why did you not mark this as a bug and your comment even sounds like you consider this merely a request for a feature. Do you not consider it wrong to provide three results in this case, even so there are only 2 groups? As I said, I'm afraid I'm missing something here, on my trip to really get a grip on groupby() |
It's not providing a duplicate value. The GroupBy internals will attempt a fast_apply, on failure it uses the old pathway. The data returned from the GroupBy should be exactly the same, provided you aren't doing something funky with globals. |
same then what? In [1]: import pandas as pd
In [2]: tdf = pd.DataFrame({'cat':[1,1,1,2,2,1,1,2],'B':range(8)})
In [3]: tdf
Out[3]:
B cat
0 0 1
1 1 1
2 2 1
3 3 2
4 4 2
5 5 1
6 6 1
7 7 2
In [4]: category = tdf['cat']
In [5]: pGroups = tdf.groupby(by=category)
In [7]: def getLen(df):
print 'len',len(df)
return len(df)
...:
In [8]: plen = pGroups.apply(getLen)
len 5
len 5
len 3 The groupby is by category, which has only 2 different items, 1 and 2. So should the group.apply(getLen) not be called only twice? I'm sorry if I'm dumb somewhere? (Using 0.10.1) |
The function is called an additional time to determine whether the function has side effects. If it does not have side effects, then a faster code path (substantially faster) can be used. If it does, the old slow code path is used. There should be an option to force the slow path to always be taken in cases like yours. |
Oh darn: In [29]: len(plen)
Out[29]: 2
In [30]: plen
Out[30]:
cat
1 5
2 3 I was wrongly assuming that the triple execution meant that the user receives 3 objects and was puzzled why you don't consider that wrong. Now that I see that, even so it is being executed 3 times, it is correctly returning only 2 objects, I understand why it's not a bug. Thanks for your patience. |
Marked this also Someday pending #2936 |
closing as not a bug, #2936 might provide this functionaility |
It seems that the apply method of a group invokes the function passed as parameter once for each group, as expectes. However, the function is invoked twice for the first group. Enclose is a short code reproducing this behavior. Is it a normal behavior or am I doing something wrong here.
I run the code with pandas 0.10.0 on. Windows 7, 32 bit version.
The code reproducing this behavior:
import pandas as pd
tdf = pd.DataFrame( { 'cat' : [1,1,1,2,2,1,1,2], 'B' : range(8)})
category = tdf['cat']
pGroups = tdf.groupby(by=category)
def getLen(df) :
print 'len ',len(df)
return len(df)
plen = pGroups.apply(getLen)
The text was updated successfully, but these errors were encountered: