Closed
Description
I was trying to document my experiences with the inconsistencies of DataFrame.groupby.apply
(see #22545), and one of them was the following:
N = 5
df = pd.DataFrame(index=range(N), columns=['id', 'x', 'y', 'z'])
df.loc[:, ['x', 'y', 'z']] = np.arange(N*3).reshape(N, 3)
df.id = np.random.randint(0, 3, (N,)) + 10
df
# id x y z
# 0 11 0 1 2
# 1 10 3 4 5
# 2 10 6 7 8
# 3 12 9 10 11
# 4 12 12 13 14
Then, even though the result returned by the function is exactly the same, the following outputs are different:
df.groupby('id', as_index=True).apply(lambda gr: gr))
# id x y z
# 0 11 0 1 2
# 1 10 3 4 5
# 2 10 6 7 8
# 3 12 9 10 11
# 4 12 12 13 14
df.groupby('id', as_index=True).apply(lambda gr: gr.iloc[:10 ** 6])
# id x y z
# id
# 10 1 10 3 4 5
# 2 10 6 7 8
# 11 0 11 0 1 2
# 12 3 12 9 10 11
# 4 12 12 13 14
The first one just returns the original frame as-is, with no attempt to actually group the results like the second output. Furthermore, both outputs should not have the id
column anymore, which is now ambiguous between the index and the columns (e.g. in case one may continue with groupby
after some further transformations)
Desired output of both:
# x y z
# id
# 10 1 3 4 5
# 2 6 7 8
# 11 0 0 1 2
# 12 3 9 10 11
# 4 12 13 14