-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Only apply first group once in fast GroupBy.apply #24748
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Only apply first group once in fast GroupBy.apply #24748
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24748 +/- ##
=======================================
Coverage 92.38% 92.38%
=======================================
Files 166 166
Lines 52358 52358
=======================================
Hits 48373 48373
Misses 3985 3985
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24748 +/- ##
==========================================
+ Coverage 91.48% 91.48% +<.01%
==========================================
Files 175 175
Lines 52881 52885 +4
==========================================
+ Hits 48376 48380 +4
Misses 4505 4505
Continue to review full report at Codecov.
|
0102842
to
7354f02
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure you actually need to change, this rather you can modify code in pandas/core/groupby/generic.py and pandas/core/groupby/ops.py, search for fast_path
.
but much more important is to actually have some tests for this.
I don't believe that changing the python code is actually sufficient since the Cython code I am modifying evaluates the first row twice, even if everything works as expected. My change here is essentially only using the calculation for the first row/test if the test is successful instead of recalculating it again. I believe you are referring to pandas/pandas/core/groupby/ops.py Line 174 in 453fa85
where fast_apply (defined here) is called and if successful, the result (calculated in Cython) is immediately returned. If an InvalidApply exception is caught, a slow, python-only path is taken. I'm not trying to optimize the slow/python path but I am arguing that even the fast_apply path will evaluate the first row twice although this isn't necessary.
Apart from implicit tests (this code section should be called at many places) there is one explicit test to ensure that the fast path can be taken and the flags for pandas/pandas/tests/groupby/test_apply.py Line 81 in 453fa85
If there are other edge cases you can think of I can add more tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fjetter this is not going to be acceptable w/o a test that fails before and passes now, meaning you have to show an explicit test that does this.
@fjetter I would like to collect all tests from any of the references issues. After you add them see if this closes them; list the issues in the top of the PR. |
@jreback Sorry for the delay. I went through the referenced issues and looked at the code examples the people where posting. Mostly the examples were pretty close to each other and I classified them into two categories (see below). Essentially most reported tickets were suggesting a very simple I added a few more tests to make the current behavior more explicit, i.e. fast path applies row/group once but slow path still does it twice. Type A1 : Groupy DataFrame on col name; apply doesn't mutate (fast path) Type B: Apply DataFrame; apply identity #2936 - Type A1 print / return new df |
9afa234
to
1c7eb85
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls add a comment with the issue number on the tests (if the test applies to multiple issue, then list both of them).
Add a whatsnew entry in 0.25.0. I would make this a sub-section in api-breaking changes, showing a mini-example and listing all of the closed issues.
pandas/tests/frame/test_apply.py
Outdated
(1, [0, 1, 2, 3, 4, 5]), | ||
]) | ||
def test_apply_first_row_once(self, axis, expected): | ||
df = pd.DataFrame({'a': [0, 0, 1, 1, 2, 2], 'b': np.arange(6)}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add the issue number as a comment
pandas/tests/frame/test_apply.py
Outdated
rows.append(row.name) | ||
return 0 | ||
df.apply(f_fast, axis=axis) | ||
# gh-2936 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh you have here, well move to the top
also we have a warning in the groupby.rst about this, can remove it. |
1c7eb85
to
b9507a2
Compare
b9507a2
to
fcf8455
Compare
I added a whatsnew entry and added the issue references to the tests. I kept the warning in |
that needs to be addressed here as well - i suspect this is actually the more common problem |
fcf8455
to
ab2c65d
Compare
I changed the implementation such that also the slow path evaluates the function only once. I refactored |
pandas/tests/groupby/test_groupby.py
Outdated
@@ -1420,20 +1420,30 @@ def foo(x): | |||
|
|||
def test_group_name_available_in_inference_pass(): | |||
# gh-15062 | |||
# GH24748 ,GH2936, GH2656, GH7739, GH10519, GH12155, GH20084, GH21417 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don’t add all the issues, just the ones that apply to each test; then it should be easy to see that we have all of the issues covered
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll remove the links from this test since this test has been there before and is supposed to test whether the group name is available. The connection to this issue is merely coincidental.
For the (very similar) test in groupby/test_apply.py
I added all these ticket numbers since you asked me to go through the issues and collect the tests. They are all duplicates and provide only very little additional information. All of them are linked to #2936 Is it ok if I remove all duplicates and only keep that one? same question for the whatsnew: list all duplicates or just the one reference issue?
doc/source/whatsnew/v0.25.0.rst
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [2]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define df in a separate block above and show it
i am ok with slightly duplicative tests if they are testing subtle behavior - if they are exact duplicates then ok to combine |
The |
doc/source/whatsnew/v0.25.0.rst
Outdated
assert side_effects == ["x", "y"] | ||
|
||
|
||
(:issue:`2936`, :issue:`2656`, :issue:`7739`, :issue:`10519`, :issue:`12155`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the issue numbers at the end of the top paragraph.
doc/source/whatsnew/v0.25.0.rst
Outdated
GroupBy.apply on ``DataFrame`` evaluates first group only once | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The implementation of ``DataFrame.groupby.apply`` previously evaluated func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see if you can make this a proper reference, I think:
:meth:`DataFrameGroupBy.apply() <pandas.core.groupby.DataFrameGroupBy.apply>`
should work
doc/source/whatsnew/v0.25.0.rst
Outdated
|
||
.. code-block:: ipython | ||
|
||
In [6]: assert side_effects == ["x", "x", "y"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just print side_effects (and it will show the 3 values)
doc/source/whatsnew/v0.25.0.rst
Outdated
|
||
.. ipython:: python | ||
|
||
assert side_effects == ["x", "y"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here, just show side_effects
pandas/_libs/reduction.pyx
Outdated
piece = piece.copy(deep='all') | ||
else: | ||
mutated = True | ||
except AttributeError: | ||
pass | ||
|
||
results.append(piece) | ||
if status > 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this notion of status is pretty opaque, can you add some comments and/or make this much more obvious what you are doing.
55a8509
to
25fa482
Compare
doc/source/whatsnew/v0.25.0.rst
Outdated
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
The implementation of :meth:`DataFrameGroupBy.apply() <pandas.core.groupby.DataFrameGroupBy.apply>` | ||
previously evaluated func consistently twice on the first group to infer if it |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
func -> the supplied function
doc/source/whatsnew/v0.25.0.rst
Outdated
is safe to use a fast code path. Particularly for functions with side effects, | ||
this was an undesired behavior and may have led to surprises. | ||
|
||
Now every group is evaluated only a single time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the issues list in parens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is affter the sentence above.
doc/source/whatsnew/v0.25.0.rst
Outdated
return group | ||
df.groupby("a").apply(func) | ||
|
||
Previous behavior: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use *Previous Behaviour*
(not the *
) and similar on New
pandas/_libs/reduction.pyx
Outdated
piece = piece.copy(deep='all') | ||
else: | ||
mutated = True | ||
except AttributeError: | ||
pass | ||
|
||
results.append(piece) | ||
if not successfull_fast_apply: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a comment here on why you are returning early
rebased & whatsnew updated. |
@WillAyd if you have any more comments. |
CI failure looked unrelated. Merged master will merge on green |
Thanks @fjetter ! |
Latest version of pandas changed the behaviour of groupby and does not call it multiple times. I believe it's related to pandas-dev/pandas#24748.
Is this true only for groupby.apply? I tried test with just |
@CMCDragonkai This issue is only fixed for the groupby.apply path. The ordinary |
Are you serious! So while groupby apply no longer has duplicate side effects, but normal apply still does!? Shouldn't it be easy to apply this fix to the situation for normal dataframe apply? |
@CMCDragonkai I'd need to check what is causing the duplicate execution to say for sure. My gut tells me that it should be similarly possible. I can not guarantee it without actually checking it. |
Are there any news regarding fixing this for the df.apply? |
@CMCDragonkai this is no way to address a community contributor who has volunteered their time to submit a pull request, please refrain from using such a tone in the future - thank you :) |
The issue of applying a function to the first row twice has been reported quite a few times (issues are usually referenced to #2936) and is also a documented shortcoming/implementation detail (see Notes in docs)
The argumentation is usually that this needs to be done to determine whether or not we can take a fast path for the calculation. Even if we're using the fast path for the calculation, however, the first row is still evaluated twice. I believe with this refactoring it would be possible to only eval the first row once iff the fast path is taken.
For functions with side effects or very expensive calculations in general this may be a big deal as discussed in various other issues.
I'm wondering if this small change may already help out folks. Truth be told, I'm not 100% certain about what's actually happening in this loop regarding the slider and item cache. For the use case where I encountered this, it seems to do the trick, though.
closes #2936
closes #2656
closes #7739
closes #10519
closes #12155
closes #20084
closes #21417
git diff upstream/master -u -- "*.py" | flake8 --diff