Skip to content

Inconsistent output in GroupBy.apply returning a DataFrame #34809

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Jun 15, 2020 · 16 comments · Fixed by #34998
Closed

Inconsistent output in GroupBy.apply returning a DataFrame #34809

TomAugspurger opened this issue Jun 15, 2020 · 16 comments · Fixed by #34998
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Discussion Requires discussion from core team before further action
Milestone

Comments

@TomAugspurger
Copy link
Contributor

This is a continuation of #13056, #14927, and #13056, which were closed by #31613. I think that PR ensured that we consistently take one of two code paths. This issue is to verify that we actually want the behavior on master.

Focusing on a specific pair of examples that differ only in whether the returned index is the same or not:

# master

In [10]: def f(x):
    ...:     return x.copy()  # same index


In [11]: def g(x):
    ...:     return x.copy().rename(lambda x: x + 1)  # different index

In [12]: df = pd.DataFrame({"A": ['a', 'b'], "B": [1, 2]})

In [13]: df.groupby("A").apply(f)
Out[13]:
   A  B
0  a  1
1  b  2

In [14]: df.groupby("A").apply(g)
Out[14]:
     A  B
A
a 1  a  1
b 2  b  2
# 1.0.4
In [8]: df.groupby("A").apply(f)
Out[8]:
     A  B
A
a 0  a  1
b 1  b  2

In [9]: df.groupby("A").apply(g)
Out[9]:
     A  B
A
a 1  a  1
b 2  b  2

So the 1.0.4 behavior is to always prepend the group keys to the result as an index level.
In pandas 1.1.0, whether the group keys are prepended depends on whether the udf returns a dataframe with an identical index. Do we want that kind of value-dependent behavior?

@jorisvandenbossche's notebook from #13056 (comment) might be helpful, though it might be out of date.

@TomAugspurger TomAugspurger added Bug Needs Triage Issue that has not been reviewed by a pandas team member Blocker Blocking issue or pull request for an upcoming release Needs Discussion Requires discussion from core team before further action Groupby and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 15, 2020
@TomAugspurger TomAugspurger added this to the 1.1 milestone Jun 15, 2020
@WillAyd
Copy link
Member

WillAyd commented Jun 16, 2020

I don't know that I have a strong enough preference to change anything with where things are now. Inconsistency is annoying, but I don't know that the old behavior is necessarily correct enough to revert back to, and that itself is inconsistent when you consider that the method being applied here is shape-preserving (e.g. a typical groupby.transform wouldn't prepend the groupings)

@jorisvandenbossche
Copy link
Member

The difference becomes a bit more apparent with a slightly larger dataframe (repeating the "a" key):

def f(x):
    return x.copy()  # same index

def g(x):
    return x.copy().rename(lambda x: x + 1)  # different index

df = pd.DataFrame({"A": ['a', 'b', 'a'], "B": [1, 2, 3]})
In [6]: df.groupby("A").apply(f) 
Out[6]: 
   A  B
0  a  1
1  b  2
2  a  3

In [7]: df.groupby("A").apply(g) 
Out[7]: 
     A  B
A        
a 1  a  1
  3  a  3
b 2  b  2

Personally I agree with the retoric question "Do we want that kind of value-dependent behavior?". Such a drastic different output (also the row order differs, not just the index) based on a subtle difference in the function / output isn't that user friendly IMO.
Quoting myself from #13056 (comment): "I would personally deprecate this, and point users to groupby().transform(..)"

Now, as @WillAyd already said, tranform doesn't prepend the groupings as the index. But, it also doesn't include the grouping column at all (also not as a normal column, like apply does in some cases):

In [8]: df.groupby("A").transform(f) 
Out[8]: 
   B
0  1
1  2
2  3

Which makes it potentially a bit difficult to point users towards that, unless we add a keyword or something to ensure the grouping column is optionally included in the output.

Another related thought: for DataFrame.apply we have a result_type keyword, to specify what kind of function you are passing / how the result should be interpreted (as a reduction, vs preserve original shape, etc). Something similar might also be useful for the groupby version.

@TomAugspurger
Copy link
Contributor Author

In my specific case .transform isn't an option because the function being applied needs to be applied to the table (it's doing things with multiple columns).

@TomAugspurger
Copy link
Contributor Author

Rereading #34809 (comment), and I think that the suggestion to include a result_type-like keyword to control whether the function should be viewed as an aggregation / transform / other makes sense. I think that's the only possible solution for my use case: a distributed groupby.nunique() that's implemented through a .groupby(grouper).apply(lambda x: x.drop_duplicates()) on each chunk. We want consistent shapes for the result of each chunk. On master, we get value-dependent behavior depending on whether there are any duplicate values or not.

I'll try to put something together in the next couple days.

@quasiben
Copy link

Wanted to leave a general note about what cudf does. cudf has a limited groupby-apply support and generally it's not super performant for GPUs. Additionally cudf generally has tried to match pandas grouping and sorting behavior where possible but here we probably will find some challenges:

Also wanted to note that rename does not support index changes

In [154]: def f(x):
     ...:     return x.copy()  # same index
     ...:
     ...: def g(x):
     ...:     import cudf
     ...:     idx = range(50, 100)
     ...:     return x.copy().set_index(x.index + 1)  # different index
     ...:
     ...: cdf = cudf.DataFrame({"A": ['a', 'b', 'a'], "B": [1, 2, 3]})

In [155]: cdf.groupby("A").apply(f)
Out[155]:
   A  B
0  a  1
1  b  2
2  a  3

In [156]: cdf.groupby("A").apply(g)
Out[156]:
   A  B
1  a  1
2  b  2
3  a  3

@TomAugspurger
Copy link
Contributor Author

Thanks. It looks like this is essentially inconsistent handling of the existing group_keys argument, which #34998 is trying to clean up. Hopefully this doesn't make things too much more complicated for cudf.

@quasiben
Copy link

Thanks @TomAugspurger and thank you for thinking about cudf handling generally. cuDF did just merge in Pandas 1.0 support: rapidsai/cudf#4546. I assume #34998 will be in 1.1 ?

@TomAugspurger
Copy link
Contributor Author

TomAugspurger commented Jun 26, 2020 via email

@jorisvandenbossche
Copy link
Member

@quasiben thanks for the input! One question about your example: so even with a non-equal index, you preserve the original order of the input dataframe. But when do you do this? If the shape of input/output of the applied function is the same? What happens if the shape is not the same, eg when the applied function is like def h(x): return x.head(2) (assuming there are groups larger than 2 rows) ?

@jorisvandenbossche
Copy link
Member

cc @fjetter (not sure if you were actually notified of follow-up discussions after your PR, see also #34998)

@quasiben
Copy link

Interestingly the order is still maintained:

In [8]: cdf = cudf.DataFrame({"A": ['a', 'b', 'a', 'a', 'a', 'b'], "B": [1, 2, 3, 6, 7, 8]})

In [9]: def h(x): return x.head(2)

In [10]: cdf.groupby("A").apply(h)
Out[10]:
   A  B
0  a  1
1  b  2
2  a  3
5  b  8

cc @kkraus14 should he have more follow up here

@kkraus14
Copy link
Contributor

cc @shwina as he wrote this code for cuDF

@shwina
Copy link
Contributor

shwina commented Jul 15, 2020

Rather than maintaining any ordering, I think what we do in cuDF is sort the result by its index before returning in apply:

In [13]: a
Out[13]:
   a  b
b  1  1
d  2  2
a  1  3
c  2  4
a  1  5

In [14]: a.groupby('a').apply(lambda x: x.head(2))
Out[14]:
   a  b
a  1  3
b  1  1
c  2  4
d  2  2

@TomAugspurger
Copy link
Contributor Author

I think the second half of #34998 (comment) covers that. Starting with "While working on this, a second values-dependent behavior was discovered".

It's unclear what we'll do about that.

@GDelevoye
Copy link

GDelevoye commented Sep 24, 2020

Hello

First of all thank you for working on pandas it's a great project, I'm using it everyday.
I am not sure wether this is the best place to mention it or not, but I think it is

I realized another indirect problem raised by the GroupBy.apply inconsistent return type: The bar plotting

df.groupby(["cluster"]).apply(lambda x: x[criteria].value_counts(normalize=True)).plot.bar()

This is (I think ?) the most instinctive way to do grouped barplots.

  • Sometimes, the plot produced will look like a groupped bar plot as expected
  • Sometimes it really won't, because the return type of the whole groupby.apply object is not a dataframe but a pd.Series

For me the real the real workaround consisted in using an additionnal "unstack":

df.groupby(["cluster"]).apply(lambda x: x[criteria].value_counts(normalize=True)).unstack().plot.bar()

The thing is, when the returned type IS a dataframe, unstack will break everything. And when the returned type IS a pandas.Series, not using the unstack will break everything. So the only viable option for the users (I think ?) is to directly test the returntype of the groupy.apply and make the conversions ourselves everytime we use it

applied = df.groupby(["cluster"]).apply(lambda x: x[criteria].value_counts(normalize=True))
if isinstance(applied,pd.Series):
    applied.unstack().plot.bar()
else:
    return applied.plot.bar()

This is neither pythonic nor practical

I am not sure why why the returned type of GroupBy.apply is inconsistent, but I do believe that it should be consistent. In my experience, my "criteria" variable is a boolean, and I realized that I get pd.Series for cases where only True or only False cases would be counted in at least one of the clusters (see my line of code above)

[EDIT: Sorry, my suggestion for the fixing was absurd, I just realized it so I deleted it]
[EDIT] Maybe just raising a more explicit exception would already help a lot

@simonjayhawkins
Copy link
Member

removing the milestone and blocker label

@simonjayhawkins simonjayhawkins removed the Blocker Blocking issue or pull request for an upcoming release label Jun 11, 2021
@simonjayhawkins simonjayhawkins removed this from the 1.3 milestone Jun 11, 2021
@mroeschke mroeschke added Apply Apply, Aggregate, Transform, Map Bug labels Aug 8, 2021
@jreback jreback added this to the 1.5 milestone Mar 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Apply Apply, Aggregate, Transform, Map Bug Groupby Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants