Skip to content

[BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 19 commits into from
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
19 commits
Select commit Hold shift + click to select a range
78de38c
Modifed tests and fixed bug in groupby/apply
christopherzimmerman Oct 21, 2019
63677e7
fixed resample docstring
christopherzimmerman Oct 21, 2019
1d99d9c
Added fixture and cleaned up tests
christopherzimmerman Oct 21, 2019
98bc673
Whatsnew and added match to test
christopherzimmerman Oct 23, 2019
fbf3202
conftest docstring and whatsnew example
christopherzimmerman Oct 23, 2019
947a5bd
Changes to tests and whatsnew
christopherzimmerman Oct 28, 2019
8c3efb0
Merge branch 'master' into apply_context
christopherzimmerman Oct 28, 2019
fa21e29
Had to parameterize the test because of group keys
christopherzimmerman Oct 28, 2019
a0a9aa5
Merge branch 'apply_context' of https://github.com/christopherzimmerm…
christopherzimmerman Oct 28, 2019
7070169
Merge branch 'master' into apply_context
christopherzimmerman Nov 1, 2019
76815f1
Update test_transform.py
christopherzimmerman Nov 1, 2019
6c49a16
Update test_groupby.py
christopherzimmerman Nov 1, 2019
8a4c1f8
Updated syntax in ipython block
christopherzimmerman Nov 1, 2019
ccf940d
Merge branch 'apply_context' of https://github.com/christopherzimmerm…
christopherzimmerman Nov 1, 2019
b7d056d
Merged from master
christopherzimmerman Nov 20, 2019
cfacfc1
Indent
christopherzimmerman Nov 20, 2019
c384c09
Merge remote-tracking branch 'upstream/master' into apply_context
christopherzimmerman Jan 2, 2020
91d1931
Merged upstream
christopherzimmerman Feb 3, 2020
83be029
More tests changed with bad apply behavior
christopherzimmerman Feb 4, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 33 additions & 33 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -721,7 +721,9 @@ def f(g):
f = func

# ignore SettingWithCopy here in case the user mutates
with option_context("mode.chained_assignment", None):
with option_context(
"mode.chained_assignment", None
) as _, _group_selection_context(self) as _:
try:
result = self._python_apply_general(f)
except TypeError:
Expand All @@ -732,9 +734,7 @@ def f(g):
# except if the udf is trying an operation that
# fails on *some* columns, e.g. a numeric operation
# on a string grouper column

with _group_selection_context(self):
return self._python_apply_general(f)
return self._python_apply_general(f)

return result

Expand Down Expand Up @@ -1465,7 +1465,7 @@ def resample(self, rule, *args, **kwargs):
... columns=['a', 'b'])
>>> df.iloc[2, 0] = 5
>>> df
a b
a b
2000-01-01 00:00:00 0 1
2000-01-01 00:01:00 0 1
2000-01-01 00:02:00 5 1
Expand All @@ -1475,63 +1475,63 @@ def resample(self, rule, *args, **kwargs):
the timestamps falling into a bin.

>>> df.groupby('a').resample('3T').sum()
a b
b
a
0 2000-01-01 00:00:00 0 2
2000-01-01 00:03:00 0 1
5 2000-01-01 00:00:00 5 1
0 2000-01-01 00:00:00 2
2000-01-01 00:03:00 1
5 2000-01-01 00:00:00 1

Upsample the series into 30 second bins.

>>> df.groupby('a').resample('30S').sum()
a b
b
a
0 2000-01-01 00:00:00 0 1
2000-01-01 00:00:30 0 0
2000-01-01 00:01:00 0 1
2000-01-01 00:01:30 0 0
2000-01-01 00:02:00 0 0
2000-01-01 00:02:30 0 0
2000-01-01 00:03:00 0 1
5 2000-01-01 00:02:00 5 1
0 2000-01-01 00:00:00 1
2000-01-01 00:00:30 0
2000-01-01 00:01:00 1
2000-01-01 00:01:30 0
2000-01-01 00:02:00 0
2000-01-01 00:02:30 0
2000-01-01 00:03:00 1
5 2000-01-01 00:02:00 1

Resample by month. Values are assigned to the month of the period.

>>> df.groupby('a').resample('M').sum()
a b
b
a
0 2000-01-31 0 3
5 2000-01-31 5 1
0 2000-01-31 3
5 2000-01-31 1

Downsample the series into 3 minute bins as above, but close the right
side of the bin interval.

>>> df.groupby('a').resample('3T', closed='right').sum()
a b
b
a
0 1999-12-31 23:57:00 0 1
2000-01-01 00:00:00 0 2
5 2000-01-01 00:00:00 5 1
0 1999-12-31 23:57:00 1
2000-01-01 00:00:00 2
5 2000-01-01 00:00:00 1

Downsample the series into 3 minute bins and close the right side of
the bin interval, but label each bin using the right edge instead of
the left.

>>> df.groupby('a').resample('3T', closed='right', label='right').sum()
a b
b
a
0 2000-01-01 00:00:00 0 1
2000-01-01 00:03:00 0 2
5 2000-01-01 00:03:00 5 1
0 2000-01-01 00:00:00 1
2000-01-01 00:03:00 2
5 2000-01-01 00:03:00 1

Add an offset of twenty seconds.

>>> df.groupby('a').resample('3T', loffset='20s').sum()
a b
b
a
0 2000-01-01 00:00:20 0 2
2000-01-01 00:03:20 0 1
5 2000-01-01 00:00:20 5 1
0 2000-01-01 00:00:20 2
2000-01-01 00:03:20 1
5 2000-01-01 00:00:20 1
"""
from pandas.core.resample import get_resampler_for_grouping

Expand Down
61 changes: 46 additions & 15 deletions pandas/tests/groupby/test_apply.py
Original file line number Diff line number Diff line change
Expand Up @@ -363,11 +363,19 @@ def f(group):
tm.assert_frame_equal(result.loc[key], f(group))


def test_apply_chunk_view():
@pytest.mark.parametrize("as_index", [False, True])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you even need this in the test? Looks like the test itself just remove the True case

def test_apply_chunk_view(as_index):
# Low level tinkering could be unsafe, make sure not
df = DataFrame({"key": [1, 1, 1, 2, 2, 2, 3, 3, 3], "value": range(9)})

result = df.groupby("key", group_keys=False).apply(lambda x: x[:2])
result = df.groupby("key", group_keys=False, as_index=as_index).apply(
lambda x: x[:2]
)
# GH 28549
# key no longer included in reduction output
if as_index:
df.pop("key")

expected = df.take([0, 1, 3, 4, 6, 7])
tm.assert_frame_equal(result, expected)

Expand All @@ -386,7 +394,8 @@ def test_apply_no_name_column_conflict():
grouped.apply(lambda x: x.sort_values("value", inplace=True))


def test_apply_typecast_fail():
@pytest.mark.parametrize("as_index", [True, False])
def test_apply_typecast_fail(as_index):
df = DataFrame(
{
"d": [1.0, 1.0, 1.0, 2.0, 2.0, 2.0],
Expand All @@ -400,7 +409,12 @@ def f(group):
group["v2"] = (v - v.min()) / (v.max() - v.min())
return group

result = df.groupby("d").apply(f)
result = df.groupby("d", as_index=as_index).apply(f)

# GH 28549
# key no longer included in reduction output
if as_index:
df.pop("d")

expected = df.copy()
expected["v2"] = np.tile([0.0, 0.5, 1], 2)
Expand All @@ -426,6 +440,10 @@ def f(group):

result = df.groupby("d").apply(f)

# GH 28549
# key no longer included in reduction output
df.pop("d")

expected = df.copy()
expected["v2"] = np.tile([0.0, 0.5, 1], 2)

Expand Down Expand Up @@ -638,24 +656,37 @@ def test_func(x):
tm.assert_frame_equal(result, expected)


def test_groupby_apply_none_first():
@pytest.mark.parametrize("as_index", [True, False])
@pytest.mark.parametrize(
"groups, vars_, expected_vars, expected_groups",
[
([1, 1, 1, 2], [0, 1, 2, 3], [0, 2], [1, 1]),
([1, 2, 2, 2], [0, 1, 2, 3], [1, 3], [2, 2]),
],
)
def test_groupby_apply_none_first(
groups, vars_, expected_vars, expected_groups, as_index
):
# GH 12824. Tests if apply returns None first.
test_df1 = DataFrame({"groups": [1, 1, 1, 2], "vars": [0, 1, 2, 3]})
test_df2 = DataFrame({"groups": [1, 2, 2, 2], "vars": [0, 1, 2, 3]})
test_df = DataFrame({"groups": groups, "vars": vars_})

def test_func(x):
if x.shape[0] < 2:
return None
return x.iloc[[0, -1]]

result1 = test_df1.groupby("groups").apply(test_func)
result2 = test_df2.groupby("groups").apply(test_func)
index1 = MultiIndex.from_arrays([[1, 1], [0, 2]], names=["groups", None])
index2 = MultiIndex.from_arrays([[2, 2], [1, 3]], names=["groups", None])
expected1 = DataFrame({"groups": [1, 1], "vars": [0, 2]}, index=index1)
expected2 = DataFrame({"groups": [2, 2], "vars": [1, 3]}, index=index2)
tm.assert_frame_equal(result1, expected1)
tm.assert_frame_equal(result2, expected2)
result = test_df.groupby("groups", as_index=as_index).apply(test_func)

# GH 28549 "groups" should not be in output of apply
# unless as_index=True
if not as_index:
expected = DataFrame(
{"groups": expected_groups, "vars": expected_vars}, index=result.index
)
else:
expected = DataFrame({"vars": expected_vars}, index=result.index)

tm.assert_frame_equal(result, expected)


def test_groupby_apply_return_empty_chunk():
Expand Down
16 changes: 13 additions & 3 deletions pandas/tests/groupby/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -95,9 +95,16 @@ def f(x):
return x.drop_duplicates("person_name").iloc[0]

result = g.apply(f)
expected = x.iloc[[0, 1]].copy()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so if the tests change a lot like this, make a new test

# GH 28549
# grouper key should not be present after apply
# with as_index=True.
# TODO split this into multiple tests
dropped = x.drop("person_id", 1)

expected = dropped.iloc[[0, 1]].copy()
expected.index = Index([1, 2], name="person_id")
expected["person_name"] = expected["person_name"].astype("object")
expected["person_name"] = expected["person_name"]
tm.assert_frame_equal(result, expected)

# GH 9921
Expand Down Expand Up @@ -1218,7 +1225,10 @@ def test_get_nonexistent_category():
# Accessing a Category that is not in the dataframe
df = pd.DataFrame({"var": ["a", "a", "b", "b"], "val": range(4)})
with pytest.raises(KeyError, match="'vau'"):
df.groupby("var").apply(
# GH2849 This needs to use as_index=False so that
# var is still present when grouping or else another key error
# will raise about var.
df.groupby("var", as_index=False).apply(
lambda rows: pd.DataFrame(
{"var": [rows.iloc[-1]["var"]], "val": [rows.iloc[-1]["vau"]]}
)
Expand Down
38 changes: 26 additions & 12 deletions pandas/tests/groupby/test_function.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,10 +87,6 @@ def test_intercept_builtin_sum():
tm.assert_series_equal(result2, expected)


# @pytest.mark.parametrize("f", [max, min, sum])
# def test_builtins_apply(f):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the kind of thing that can be done in a separate PR



@pytest.mark.parametrize("f", [max, min, sum])
@pytest.mark.parametrize("keys", ["jim", ["jim", "joe"]]) # Single key # Multi-key
def test_builtins_apply(keys, f):
Expand All @@ -102,21 +98,33 @@ def test_builtins_apply(keys, f):
result = df.groupby(keys).apply(f)
ngroups = len(df.drop_duplicates(subset=keys))

assert_msg = "invalid frame shape: {} (expected ({}, 3))".format(
result.shape, ngroups
# GH 28549
# grouping keys should not be included in output
if isinstance(keys, list):
result_shape = len(df.columns) - len(keys)
else:
result_shape = len(df.columns) - 1

assert_msg = "invalid frame shape: {} (expected ({}, {}))".format(
result.shape, ngroups, result_shape
)
assert result.shape == (ngroups, 3), assert_msg
assert result.shape == (ngroups, result_shape), assert_msg

tm.assert_frame_equal(
result, # numpy's equivalent function
df.groupby(keys).apply(getattr(np, fname)),
)

if f != sum:
expected = df.groupby(keys).agg(fname).reset_index()
expected.set_index(keys, inplace=True, drop=False)
# GH 28549
# No longer need to reset/set index here
expected = df.groupby(keys).agg(fname)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you try not to use groupby here and just construct the expected result.

the comment is also not very useful as a future reader won't have context.

can you break this out to a new test

tm.assert_frame_equal(result, expected, check_dtype=False)

# GH 28549
# grouping keys should not be in output
df = df.drop(keys, 1)

tm.assert_series_equal(getattr(result, fname)(), getattr(df, fname)())


Expand Down Expand Up @@ -341,10 +349,13 @@ def test_cython_api2():
tm.assert_frame_equal(result, expected)

# GH 13994
result = df.groupby("A").cumsum(axis=1)
# GH 28549
# Good represention of when as_index=False is now behaving
# as expected
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

split to a separate tests and use as_index fixture

Copy link
Contributor Author

@christopherzimmerman christopherzimmerman Oct 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was unsure what to do about this test. There is a comment in the original test that says "cumsum is a transformer and should ignore as_index", but with the updated behavior, it respects as_index along the first axis. I'm not sure what the actual behavior should be in this case.

Previous Behavior

In [21]: df = DataFrame([[1, 2, np.nan], [1, np.nan, 9], [3, 4, 9]], columns=["A", "B", "C"])

In [22]: df.groupby("A").cumsum(axis=1)
Out[22]:
     A    B     C
0  1.0  3.0   NaN
1  1.0  NaN  10.0
2  3.0  7.0  16.0

Behavior after change to apply

In [4]: df = DataFrame([[1, 2, np.nan], [1, np.nan, 9], [3, 4, 9]], columns=["A", "B", "C"])

In [5]: df.groupby("A").cumsum(axis=1)
Out[5]:
     B     C
0  2.0   NaN
1  NaN   9.0
2  4.0  13.0

result = df.groupby("A", as_index=False).cumsum(axis=1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test as_indexer=True as well. this seems like a big api change here

expected = df.cumsum(axis=1)
tm.assert_frame_equal(result, expected)
result = df.groupby("A").cumprod(axis=1)
result = df.groupby("A", as_index=False).cumprod(axis=1)
expected = df.cumprod(axis=1)
tm.assert_frame_equal(result, expected)

Expand Down Expand Up @@ -1107,7 +1118,10 @@ def test_count():

for key in ["1st", "2nd", ["1st", "2nd"]]:
left = df.groupby(key).count()
right = df.groupby(key).apply(DataFrame.count).drop(key, axis=1)

# GH 28549
# don't need to drop key here anymore
right = df.groupby(key).apply(DataFrame.count)
tm.assert_frame_equal(left, right)


Expand Down
Loading