-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
[BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 16 commits
78de38c
63677e7
1d99d9c
98bc673
fbf3202
947a5bd
8c3efb0
fa21e29
a0a9aa5
7070169
76815f1
6c49a16
8a4c1f8
ccf940d
b7d056d
cfacfc1
c384c09
91d1931
83be029
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -195,6 +195,70 @@ New repr for :class:`pandas.core.arrays.IntervalArray` | |
pd.arrays.IntervalArray.from_tuples([(0, 1), (2, 3)]) | ||
|
||
|
||
.. _whatsnew_1000.api_breaking.GroupBy.apply: | ||
|
||
``GroupBy.apply`` behaves consistently with `as_index` | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
- Previously, the result of :meth:`GroupBy.apply` sometimes contained the grouper column(s), | ||
in both the index, and in the `DataFrame`. :meth:`GroupBy.apply` | ||
now respects the ``as_index`` parameter, and only returns the grouper column(s) in | ||
the result if ``as_index`` is set to `False`. Other methods such as :meth:`GroupBy.resample` | ||
exhibited similar behavior and now also respect the ``as_index`` parameter. | ||
|
||
*Previous Behavior* | ||
|
||
.. code-block:: ipython | ||
|
||
In [1]: df = pd.DataFrame({"a": [1, 1, 2, 2, 3, 3], "b": [1, 2, 3, 4, 5, 6]}) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. show df here |
||
In [2]: df.groupby("a").apply(lambda x: x.sum()) | ||
Out[2]: | ||
a b | ||
a | ||
1 2 3 | ||
2 4 7 | ||
3 6 11 | ||
|
||
In [3]: df.groupby("a").apply(lambda x: x.iloc[0]) | ||
Out[3]: | ||
a b | ||
a | ||
1 1 1 | ||
2 2 3 | ||
3 3 5 | ||
|
||
In [4]: idx = pd.date_range('1/1/2000', periods=4, freq='T') | ||
|
||
In [5]: df = pd.DataFrame(data=4 * [range(2)], | ||
...: index=idx, | ||
...: columns=['a', 'b']) | ||
|
||
In [6]: df.iloc[2, 0] = 5 | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. show df |
||
In [7]: df.groupby('a').resample('M').sum() | ||
Out[7]: | ||
a b | ||
a | ||
0 2000-01-31 0 3 | ||
5 2000-01-31 5 1 | ||
|
||
|
||
*Current Behavior* | ||
|
||
.. ipython:: python | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. break this up into 2 or more examples, its just too hard to follow like this. meaning: change1 change2 |
||
|
||
df = pd.DataFrame({"a": [1, 1, 2, 2, 3, 3], "b": [1, 2, 3, 4, 5, 6]}) | ||
df.groupby("a").apply(lambda x: x.sum()) | ||
df.groupby("a").apply(lambda x: x.iloc[0]) | ||
idx = pd.date_range('1/1/2000', periods=4, freq='T') | ||
df = pd.DataFrame(data=4 * [range(2)], | ||
index=idx, | ||
columns=['a', 'b']) | ||
df.iloc[2, 0] = 5 | ||
df.groupby('a').resample('M').sum() | ||
|
||
|
||
All :class:`SeriesGroupBy` aggregation methods now respect the ``observed`` keyword | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
The following methods now also correctly output values for unobserved categories when called through ``groupby(..., observed=False)`` (:issue:`17605`) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -94,9 +94,16 @@ def f(x): | |
return x.drop_duplicates("person_name").iloc[0] | ||
|
||
result = g.apply(f) | ||
expected = x.iloc[[0, 1]].copy() | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. so if the tests change a lot like this, make a new test |
||
# GH 28549 | ||
# grouper key should not be present after apply | ||
# with as_index=True. | ||
# TODO split this into multiple tests | ||
dropped = x.drop("person_id", 1) | ||
|
||
expected = dropped.iloc[[0, 1]].copy() | ||
expected.index = Index([1, 2], name="person_id") | ||
expected["person_name"] = expected["person_name"].astype("object") | ||
expected["person_name"] = expected["person_name"] | ||
tm.assert_frame_equal(result, expected) | ||
|
||
# GH 9921 | ||
|
@@ -1247,6 +1254,16 @@ def test_get_nonexistent_category(): | |
# Accessing a Category that is not in the dataframe | ||
df = pd.DataFrame({"var": ["a", "a", "b", "b"], "val": range(4)}) | ||
with pytest.raises(KeyError, match="'vau'"): | ||
df.groupby("var").apply( | ||
lambda rows: pd.DataFrame({"val": [rows.iloc[-1]["vau"]]}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nitpick: if its the |
||
) | ||
|
||
|
||
def test_category_as_grouper_keys(as_index): | ||
# Accessing a key that is not in the dataframe | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. wait this raises now? |
||
df = pd.DataFrame({"var": ["a", "a", "b", "b"], "val": range(4)}) | ||
bad_key = "'var'" if as_index else "'vau'" | ||
with pytest.raises(KeyError, match=bad_key): | ||
df.groupby("var").apply( | ||
lambda rows: pd.DataFrame( | ||
{"var": [rows.iloc[-1]["var"]], "val": [rows.iloc[-1]["vau"]]} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to move to 1.1