Skip to content

The apply method over a group seems to visit the first group twice #2656

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
srippa opened this issue Jan 8, 2013 · 8 comments · Fixed by #24748
Closed

The apply method over a group seems to visit the first group twice #2656

srippa opened this issue Jan 8, 2013 · 8 comments · Fixed by #24748
Labels
Milestone

Comments

@srippa
Copy link

srippa commented Jan 8, 2013

It seems that the apply method of a group invokes the function passed as parameter once for each group, as expectes. However, the function is invoked twice for the first group. Enclose is a short code reproducing this behavior. Is it a normal behavior or am I doing something wrong here.

I run the code with pandas 0.10.0 on. Windows 7, 32 bit version.

The code reproducing this behavior:

import pandas as pd

tdf = pd.DataFrame( { 'cat' : [1,1,1,2,2,1,1,2], 'B' : range(8)})
category = tdf['cat']
pGroups = tdf.groupby(by=category)

def getLen(df) :
print 'len ',len(df)
return len(df)

plen = pGroups.apply(getLen)

@wesm
Copy link
Member

wesm commented Jan 8, 2013

Yes. The issue is that the groupby infrastructure can take a very fast code path if the function returns a new object compared with a modified version of the chunk or a view on the chunk. This is new in 0.10-- may make sense to add an option to groupby that disables this behavior, taking the slower code path

@michaelaye
Copy link
Contributor

Hm, Wes, maybe I'm missing a point here, but why did you not mark this as a bug and your comment even sounds like you consider this merely a request for a feature. Do you not consider it wrong to provide three results in this case, even so there are only 2 groups? As I said, I'm afraid I'm missing something here, on my trip to really get a grip on groupby()

@dalejung
Copy link
Contributor

It's not providing a duplicate value. The GroupBy internals will attempt a fast_apply, on failure it uses the old pathway. The data returned from the GroupBy should be exactly the same, provided you aren't doing something funky with globals.

@michaelaye
Copy link
Contributor

data returned from the GroupBy should be exactly the same

same then what?
This is what I get from above example:

In [1]: import pandas as pd

In [2]: tdf = pd.DataFrame({'cat':[1,1,1,2,2,1,1,2],'B':range(8)})

In [3]: tdf
Out[3]: 
   B  cat
0  0    1
1  1    1
2  2    1
3  3    2
4  4    2
5  5    1
6  6    1
7  7    2

In [4]: category = tdf['cat']

In [5]: pGroups = tdf.groupby(by=category)

In [7]: def getLen(df):
    print 'len',len(df)
    return len(df)
   ...: 

In [8]: plen = pGroups.apply(getLen)
len 5
len 5
len 3

The groupby is by category, which has only 2 different items, 1 and 2. So should the group.apply(getLen) not be called only twice? I'm sorry if I'm dumb somewhere? (Using 0.10.1)

@wesm
Copy link
Member

wesm commented Feb 11, 2013

The function is called an additional time to determine whether the function has side effects. If it does not have side effects, then a faster code path (substantially faster) can be used. If it does, the old slow code path is used. There should be an option to force the slow path to always be taken in cases like yours.

@michaelaye
Copy link
Contributor

Oh darn:

In [29]: len(plen)
Out[29]: 2

In [30]: plen
Out[30]: 
cat
1      5
2      3

I was wrongly assuming that the triple execution meant that the user receives 3 objects and was puzzled why you don't consider that wrong. Now that I see that, even so it is being executed 3 times, it is correctly returning only 2 objects, I understand why it's not a bug. Thanks for your patience.

@wesm
Copy link
Member

wesm commented Apr 7, 2013

Marked this also Someday pending #2936

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

closing as not a bug, #2936 might provide this functionaility

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants