Skip to content

GroupBy.apply repeats the first group N times when slicing and sorting in the applied function #25892

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
gilbenzvi opened this issue Mar 27, 2019 · 5 comments

Comments

@gilbenzvi
Copy link

Hi,

When using GroupBy.apply on function that applies pandas.DataFrame.sort_values() with inplace=True and also data-frame slicing, the first group is evaluated N times (N is the original number of the groups).

See the following code and output:

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> A = np.repeat('a', 8)
>>> B = np.repeat([1, 2, 3, 4], 2)
>>> data = pd.DataFrame.from_dict({'A': A, 'B': B})
>>> print(data)
   A  B
0  a  1
1  a  1
2  a  2
3  a  2
4  a  3
5  a  3
6  a  4
7  a  4
>>> 
>>> def func(group_data, sort, slice):
...     if sort:
...         group_data.sort_values('A', inplace=True)
...     if slice:
...         group_data = group_data[:]
...     return group_data
... 
>>> B_groups = data.groupby('B')
>>> out_sort_slice = B_groups.apply(lambda x: func(x, sort=True, slice=True))
>>> out_sort_no_slice = B_groups.apply(lambda x: func(x, sort=True, slice=False))
>>> out_no_sort_slice = B_groups.apply(lambda x: func(x, sort=False, slice=True))
>>> out_no_sort_no_slice = B_groups.apply(lambda x: func(x, sort=False, slice=False))
>>> 
>>> print(out_sort_slice)
     A  B
B        
1 0  a  1
  1  a  1
2 0  a  1
  1  a  1
3 0  a  1
  1  a  1
4 0  a  1
  1  a  1
>>> print(out_sort_no_slice)
   A  B
0  a  1
1  a  1
2  a  2
3  a  2
4  a  3
5  a  3
6  a  4
7  a  4
>>> print(out_no_sort_slice)
     A  B
B        
1 0  a  1
  1  a  1
2 2  a  2
  3  a  2
3 4  a  3
  5  a  3
4 6  a  4
  7  a  4
>>> print(out_no_sort_no_slice)
   A  B
0  a  1
1  a  1
2  a  2
3  a  2
4  a  3
5  a  3
6  a  4
7  a  4

As you can see, out_sort_no_slice, out_no_sort_slice and out_no_sort_no_slice get the expected outputs, while the output of out_sort_slice contains the first group 4 times. while debugging, you can see the first group evaluated 4 times inside func.
Notice that the data is already sorted by field 'A' and the slicing part keeps all the data in the data-frame.

I'm using Anaconda Python 3.7 with Pandas 0.24.2 on Mac.

Thanks!

@jreback
Copy link
Contributor

jreback commented Mar 27, 2019

why would you ever try to inplace modify in a groupby? this makes no sense
and there are no guarantees here

as for double evaluations you should try on master as this was just merged

@gilbenzvi
Copy link
Author

why it makes no sense? in each group there's some data processing and then I want to sort the group data object itself. Of course I can use inplace=False and get a copy of the group data, but why should I? shouldn't I assume each group is a different object?

as for the double evaluations issue - yes, I'm aware of it, thanks

@tzuni
Copy link

tzuni commented Apr 3, 2019

Of course I can use inplace=False and get a copy of the group data, but why should I? shouldn't I assume each group is a different object?

Modifying the contents of a groupby object is akin to modifying the contents of a collection you are iterating over. It's in general unpredictable and considered bad practice.

In general, apply functions are not intended to modify in place the thing on which they are applied. The established pattern is:

newthing = thing.apply(func)

if you want to sort a column you assign the sorted version back to itself:

col = col.apply(sort)

Similarly, here you should not sort the group in place but return a the sorted group as a new object.

The documentation gives a warning that apply() calls the function twice as a consequence of speed optimization and warns you to not do anything that would have side effects. The behavior you're seeing here is a result of that.

@WillAyd
Copy link
Member

WillAyd commented Apr 4, 2019

Right the first group appearing more than once should have been resolved in #24748 and comments above provide good comment with respect to modifications made during iteration.

@WillAyd
Copy link
Member

WillAyd commented Apr 4, 2019

Closing as no action

@WillAyd WillAyd closed this as completed Apr 4, 2019
@WillAyd WillAyd added this to the No action milestone Apr 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants