GroupBy.apply repeats the first group N times when slicing and sorting in the applied function #25892

gilbenzvi · 2019-03-27T14:49:54Z

Hi,

When using GroupBy.apply on function that applies pandas.DataFrame.sort_values() with inplace=True and also data-frame slicing, the first group is evaluated N times (N is the original number of the groups).

See the following code and output:

>>> import numpy as np
>>> import pandas as pd
>>> 
>>> A = np.repeat('a', 8)
>>> B = np.repeat([1, 2, 3, 4], 2)
>>> data = pd.DataFrame.from_dict({'A': A, 'B': B})
>>> print(data)
   A  B
0  a  1
1  a  1
2  a  2
3  a  2
4  a  3
5  a  3
6  a  4
7  a  4
>>> 
>>> def func(group_data, sort, slice):
...     if sort:
...         group_data.sort_values('A', inplace=True)
...     if slice:
...         group_data = group_data[:]
...     return group_data
... 
>>> B_groups = data.groupby('B')
>>> out_sort_slice = B_groups.apply(lambda x: func(x, sort=True, slice=True))
>>> out_sort_no_slice = B_groups.apply(lambda x: func(x, sort=True, slice=False))
>>> out_no_sort_slice = B_groups.apply(lambda x: func(x, sort=False, slice=True))
>>> out_no_sort_no_slice = B_groups.apply(lambda x: func(x, sort=False, slice=False))
>>> 
>>> print(out_sort_slice)
     A  B
B        
1 0  a  1
  1  a  1
2 0  a  1
  1  a  1
3 0  a  1
  1  a  1
4 0  a  1
  1  a  1
>>> print(out_sort_no_slice)
   A  B
0  a  1
1  a  1
2  a  2
3  a  2
4  a  3
5  a  3
6  a  4
7  a  4
>>> print(out_no_sort_slice)
     A  B
B        
1 0  a  1
  1  a  1
2 2  a  2
  3  a  2
3 4  a  3
  5  a  3
4 6  a  4
  7  a  4
>>> print(out_no_sort_no_slice)
   A  B
0  a  1
1  a  1
2  a  2
3  a  2
4  a  3
5  a  3
6  a  4
7  a  4

As you can see, out_sort_no_slice, out_no_sort_slice and out_no_sort_no_slice get the expected outputs, while the output of out_sort_slice contains the first group 4 times. while debugging, you can see the first group evaluated 4 times inside func.
Notice that the data is already sorted by field 'A' and the slicing part keeps all the data in the data-frame.

I'm using Anaconda Python 3.7 with Pandas 0.24.2 on Mac.

Thanks!

jreback · 2019-03-27T15:04:39Z

why would you ever try to inplace modify in a groupby? this makes no sense
and there are no guarantees here

as for double evaluations you should try on master as this was just merged

gilbenzvi · 2019-03-28T08:35:43Z

why it makes no sense? in each group there's some data processing and then I want to sort the group data object itself. Of course I can use inplace=False and get a copy of the group data, but why should I? shouldn't I assume each group is a different object?

as for the double evaluations issue - yes, I'm aware of it, thanks

tzuni · 2019-04-03T19:36:19Z

Of course I can use inplace=False and get a copy of the group data, but why should I? shouldn't I assume each group is a different object?

Modifying the contents of a groupby object is akin to modifying the contents of a collection you are iterating over. It's in general unpredictable and considered bad practice.

In general, apply functions are not intended to modify in place the thing on which they are applied. The established pattern is:

newthing = thing.apply(func)

if you want to sort a column you assign the sorted version back to itself:

col = col.apply(sort)

Similarly, here you should not sort the group in place but return a the sorted group as a new object.

The documentation gives a warning that apply() calls the function twice as a consequence of speed optimization and warns you to not do anything that would have side effects. The behavior you're seeing here is a result of that.

WillAyd · 2019-04-04T02:02:21Z

Right the first group appearing more than once should have been resolved in #24748 and comments above provide good comment with respect to modifications made during iteration.

WillAyd · 2019-04-04T02:02:27Z

Closing as no action

WillAyd closed this as completed Apr 4, 2019

WillAyd added this to the No action milestone Apr 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GroupBy.apply repeats the first group N times when slicing and sorting in the applied function #25892

GroupBy.apply repeats the first group N times when slicing and sorting in the applied function #25892

gilbenzvi commented Mar 27, 2019

jreback commented Mar 27, 2019

gilbenzvi commented Mar 28, 2019

tzuni commented Apr 3, 2019

WillAyd commented Apr 4, 2019

WillAyd commented Apr 4, 2019

GroupBy.apply repeats the first group N times when slicing and sorting in the applied function #25892

GroupBy.apply repeats the first group N times when slicing and sorting in the applied function #25892

Comments

gilbenzvi commented Mar 27, 2019

jreback commented Mar 27, 2019

gilbenzvi commented Mar 28, 2019

tzuni commented Apr 3, 2019

WillAyd commented Apr 4, 2019

WillAyd commented Apr 4, 2019