-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Inconsistent output in GroupBy.apply returning a DataFrame #34809
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I don't know that I have a strong enough preference to change anything with where things are now. Inconsistency is annoying, but I don't know that the old behavior is necessarily correct enough to revert back to, and that itself is inconsistent when you consider that the method being applied here is shape-preserving (e.g. a typical groupby.transform wouldn't prepend the groupings) |
The difference becomes a bit more apparent with a slightly larger dataframe (repeating the "a" key): def f(x):
return x.copy() # same index
def g(x):
return x.copy().rename(lambda x: x + 1) # different index
df = pd.DataFrame({"A": ['a', 'b', 'a'], "B": [1, 2, 3]})
Personally I agree with the retoric question "Do we want that kind of value-dependent behavior?". Such a drastic different output (also the row order differs, not just the index) based on a subtle difference in the function / output isn't that user friendly IMO. Now, as @WillAyd already said,
Which makes it potentially a bit difficult to point users towards that, unless we add a keyword or something to ensure the grouping column is optionally included in the output. Another related thought: for |
In my specific case |
Rereading #34809 (comment), and I think that the suggestion to include a I'll try to put something together in the next couple days. |
Wanted to leave a general note about what cudf does. cudf has a limited groupby-apply support and generally it's not super performant for GPUs. Additionally cudf generally has tried to match pandas grouping and sorting behavior where possible but here we probably will find some challenges: Also wanted to note that In [154]: def f(x):
...: return x.copy() # same index
...:
...: def g(x):
...: import cudf
...: idx = range(50, 100)
...: return x.copy().set_index(x.index + 1) # different index
...:
...: cdf = cudf.DataFrame({"A": ['a', 'b', 'a'], "B": [1, 2, 3]})
In [155]: cdf.groupby("A").apply(f)
Out[155]:
A B
0 a 1
1 b 2
2 a 3
In [156]: cdf.groupby("A").apply(g)
Out[156]:
A B
1 a 1
2 b 2
3 a 3 |
Thanks. It looks like this is essentially inconsistent handling of the existing |
Thanks @TomAugspurger and thank you for thinking about cudf handling generally. cuDF did just merge in Pandas 1.0 support: rapidsai/cudf#4546. I assume #34998 will be in 1.1 ? |
Yep.
…On Fri, Jun 26, 2020 at 2:34 PM Benjamin Zaitlen ***@***.***> wrote:
Thanks @TomAugspurger <https://github.com/TomAugspurger> and thank you
for thinking about cudf handling generally. cuDF did just merge in Pandas
1.0 support: rapidsai/cudf#4546
<rapidsai/cudf#4546>. I assume #34998
<#34998> will be in 1.1 ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#34809 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAKAOISNDF4TMXM7RIR7CODRYTZ6HANCNFSM4N6N6TAA>
.
|
@quasiben thanks for the input! One question about your example: so even with a non-equal index, you preserve the original order of the input dataframe. But when do you do this? If the shape of input/output of the applied function is the same? What happens if the shape is not the same, eg when the applied function is like |
Interestingly the order is still maintained: In [8]: cdf = cudf.DataFrame({"A": ['a', 'b', 'a', 'a', 'a', 'b'], "B": [1, 2, 3, 6, 7, 8]})
In [9]: def h(x): return x.head(2)
In [10]: cdf.groupby("A").apply(h)
Out[10]:
A B
0 a 1
1 b 2
2 a 3
5 b 8 cc @kkraus14 should he have more follow up here |
cc @shwina as he wrote this code for cuDF |
Rather than maintaining any ordering, I think what we do in cuDF is sort the result by its index before returning in In [13]: a
Out[13]:
a b
b 1 1
d 2 2
a 1 3
c 2 4
a 1 5
In [14]: a.groupby('a').apply(lambda x: x.head(2))
Out[14]:
a b
a 1 3
b 1 1
c 2 4
d 2 2 |
I think the second half of #34998 (comment) covers that. Starting with "While working on this, a second values-dependent behavior was discovered". It's unclear what we'll do about that. |
Hello First of all thank you for working on pandas it's a great project, I'm using it everyday. I realized another indirect problem raised by the GroupBy.apply inconsistent return type: The bar plotting df.groupby(["cluster"]).apply(lambda x: x[criteria].value_counts(normalize=True)).plot.bar() This is (I think ?) the most instinctive way to do grouped barplots.
For me the real the real workaround consisted in using an additionnal "unstack": df.groupby(["cluster"]).apply(lambda x: x[criteria].value_counts(normalize=True)).unstack().plot.bar() The thing is, when the returned type IS a dataframe, unstack will break everything. And when the returned type IS a pandas.Series, not using the unstack will break everything. So the only viable option for the users (I think ?) is to directly test the returntype of the groupy.apply and make the conversions ourselves everytime we use it applied = df.groupby(["cluster"]).apply(lambda x: x[criteria].value_counts(normalize=True))
if isinstance(applied,pd.Series):
applied.unstack().plot.bar()
else:
return applied.plot.bar() This is neither pythonic nor practical I am not sure why why the returned type of GroupBy.apply is inconsistent, but I do believe that it should be consistent. In my experience, my "criteria" variable is a boolean, and I realized that I get pd.Series for cases where only True or only False cases would be counted in at least one of the clusters (see my line of code above) [EDIT: Sorry, my suggestion for the fixing was absurd, I just realized it so I deleted it] |
removing the milestone and blocker label |
This is a continuation of #13056, #14927, and #13056, which were closed by #31613. I think that PR ensured that we consistently take one of two code paths. This issue is to verify that we actually want the behavior on master.
Focusing on a specific pair of examples that differ only in whether the returned index is the same or not:
So the 1.0.4 behavior is to always prepend the group keys to the result as an index level.
In pandas 1.1.0, whether the group keys are prepended depends on whether the udf returns a dataframe with an identical index. Do we want that kind of value-dependent behavior?
@jorisvandenbossche's notebook from #13056 (comment) might be helpful, though it might be out of date.
The text was updated successfully, but these errors were encountered: