Skip to content

ENH: Only apply first group once in fast GroupBy.apply #24748

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 11 commits into from
Mar 26, 2019

Conversation

fjetter
Copy link
Member

@fjetter fjetter commented Jan 13, 2019

The issue of applying a function to the first row twice has been reported quite a few times (issues are usually referenced to #2936) and is also a documented shortcoming/implementation detail (see Notes in docs)

The argumentation is usually that this needs to be done to determine whether or not we can take a fast path for the calculation. Even if we're using the fast path for the calculation, however, the first row is still evaluated twice. I believe with this refactoring it would be possible to only eval the first row once iff the fast path is taken.
For functions with side effects or very expensive calculations in general this may be a big deal as discussed in various other issues.

I'm wondering if this small change may already help out folks. Truth be told, I'm not 100% certain about what's actually happening in this loop regarding the slider and item cache. For the use case where I encountered this, it seems to do the trick, though.

closes #2936
closes #2656
closes #7739
closes #10519
closes #12155
closes #20084
closes #21417

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

@codecov
Copy link

codecov bot commented Jan 13, 2019

Codecov Report

Merging #24748 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master   #24748   +/-   ##
=======================================
  Coverage   92.38%   92.38%           
=======================================
  Files         166      166           
  Lines       52358    52358           
=======================================
  Hits        48373    48373           
  Misses       3985     3985
Flag Coverage Δ
#multiple 90.81% <ø> (ø) ⬆️
#single 43.07% <ø> (ø) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33f91d8...0102842. Read the comment docs.

@codecov
Copy link

codecov bot commented Jan 13, 2019

Codecov Report

Merging #24748 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #24748      +/-   ##
==========================================
+ Coverage   91.48%   91.48%   +<.01%     
==========================================
  Files         175      175              
  Lines       52881    52885       +4     
==========================================
+ Hits        48376    48380       +4     
  Misses       4505     4505
Flag Coverage Δ
#multiple 90.04% <100%> (ø) ⬆️
#single 41.82% <0%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/groupby/ops.py 94.21% <100%> (+0.05%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51c6a05...ecb7294. Read the comment docs.

@fjetter fjetter force-pushed the PERF/only_apply_first_row_once branch from 0102842 to 7354f02 Compare January 13, 2019 09:38
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure you actually need to change, this rather you can modify code in pandas/core/groupby/generic.py and pandas/core/groupby/ops.py, search for fast_path.

but much more important is to actually have some tests for this.

@fjetter
Copy link
Member Author

fjetter commented Jan 14, 2019

I don't believe that changing the python code is actually sufficient since the Cython code I am modifying evaluates the first row twice, even if everything works as expected. My change here is essentially only using the calculation for the first row/test if the test is successful instead of recalculating it again.

I believe you are referring to

values, mutated = splitter.fast_apply(f, group_keys)

where fast_apply(defined here) is called and if successful, the result (calculated in Cython) is immediately returned. If an InvalidApply exception is caught, a slow, python-only path is taken. I'm not trying to optimize the slow/python path but I am arguing that even the fast_apply path will evaluate the first row twice although this isn't necessary.

Apart from implicit tests (this code section should be called at many places) there is one explicit test to ensure that the fast path can be taken and the flags for mutated are returned as expected, see

def test_fast_apply():

If there are other edge cases you can think of I can add more tests.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fjetter this is not going to be acceptable w/o a test that fails before and passes now, meaning you have to show an explicit test that does this.

@jreback
Copy link
Contributor

jreback commented Jan 16, 2019

@fjetter I would like to collect all tests from any of the references issues. After you add them see if this closes them; list the issues in the top of the PR.

@pep8speaks
Copy link

pep8speaks commented Jan 26, 2019

Hello @fjetter! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-26 03:47:26 UTC

@fjetter
Copy link
Member Author

fjetter commented Jan 26, 2019

@jreback Sorry for the delay. I went through the referenced issues and looked at the code examples the people where posting. Mostly the examples were pretty close to each other and I classified them into two categories (see below). Essentially most reported tickets were suggesting a very simple DataFrame.groupby.apply using a function which returns a scalar. This triggers the fast path which is what my change should fix. There was only a single issue where the example triggered the slow path computation by applying the identity. (Type B: When using the identity there is a check in place which infers that the group may have been mutated which then triggers the fallback to the slow path)

I added a few more tests to make the current behavior more explicit, i.e. fast path applies row/group once but slow path still does it twice.
Although, I do have some ideas on how to improve the situation for the slow path; this would require larger changes and I do not see this happening in the scope of this PR. I believe improving the fast path alone is already offering value to the users.

Type A1 : Groupy DataFrame on col name; apply doesn't mutate (fast path)
Type A2 : Groupy DataFrame on group keys; apply doesn't mutate (fast path)
Type A3 : Groupy DataFrame on multiple names; apply doesn't mutate (fast path)

Type B: Apply DataFrame; apply identity

#2936 - Type A1 print / return new df
#2656 - Type A2 print / return len
#6753 - Type B - print / identity; slow path
#7739 - Type A1 - print / return 0
#10222 - Question
#10519 - Type A1 - print / return None
#12155 - Type A1 - print / return None
#16946 - Question
#20084 - Type A1 - return group.describe
#21417 - Type A3 - print / return None

@fjetter fjetter force-pushed the PERF/only_apply_first_row_once branch 2 times, most recently from 9afa234 to 1c7eb85 Compare January 26, 2019 14:21
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add a comment with the issue number on the tests (if the test applies to multiple issue, then list both of them).

Add a whatsnew entry in 0.25.0. I would make this a sub-section in api-breaking changes, showing a mini-example and listing all of the closed issues.

(1, [0, 1, 2, 3, 4, 5]),
])
def test_apply_first_row_once(self, axis, expected):
df = pd.DataFrame({'a': [0, 0, 1, 1, 2, 2], 'b': np.arange(6)})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add the issue number as a comment

rows.append(row.name)
return 0
df.apply(f_fast, axis=axis)
# gh-2936
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh you have here, well move to the top

@jreback
Copy link
Contributor

jreback commented Jan 26, 2019

also we have a warning in the groupby.rst about this, can remove it.

@fjetter fjetter force-pushed the PERF/only_apply_first_row_once branch from 1c7eb85 to b9507a2 Compare January 27, 2019 10:47
@fjetter fjetter changed the title ENH / RFC Only apply first row once in fast apply ENH Only apply first group once in fast GroupBy.apply Jan 27, 2019
@fjetter fjetter changed the title ENH Only apply first group once in fast GroupBy.apply ENH: Only apply first group once in fast GroupBy.apply Jan 27, 2019
@fjetter fjetter force-pushed the PERF/only_apply_first_row_once branch from b9507a2 to fcf8455 Compare January 27, 2019 10:52
@fjetter
Copy link
Member Author

fjetter commented Jan 27, 2019

I added a whatsnew entry and added the issue references to the tests. I kept the warning in groupby.rst since the double func evaluation may still occur if we need to fall back to the slow path.

@jreback
Copy link
Contributor

jreback commented Jan 27, 2019

I added a whatsnew entry and added the issue references to the tests. I kept the warning in groupby.rst since the double func evaluation may still occur if we need to fall back to the slow path.

that needs to be addressed here as well - i suspect this is actually the more common problem

@fjetter fjetter force-pushed the PERF/only_apply_first_row_once branch from fcf8455 to ab2c65d Compare January 27, 2019 15:59
@fjetter
Copy link
Member Author

fjetter commented Jan 27, 2019

I changed the implementation such that also the slow path evaluates the function only once. I refactored apply_frame_axis0 to signal the sucess/validity of the fast apply via a status code instead of an exception. If the status code is 1, we fall back to the slow path calculation but will use the already calculated result for the first group.

@@ -1420,20 +1420,30 @@ def foo(x):

def test_group_name_available_in_inference_pass():
# gh-15062
# GH24748 ,GH2936, GH2656, GH7739, GH10519, GH12155, GH20084, GH21417
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don’t add all the issues, just the ones that apply to each test; then it should be easy to see that we have all of the issues covered

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove the links from this test since this test has been there before and is supposed to test whether the group name is available. The connection to this issue is merely coincidental.

For the (very similar) test in groupby/test_apply.py I added all these ticket numbers since you asked me to go through the issues and collect the tests. They are all duplicates and provide only very little additional information. All of them are linked to #2936 Is it ok if I remove all duplicates and only keep that one? same question for the whatsnew: list all duplicates or just the one reference issue?


.. code-block:: ipython

In [2]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define df in a separate block above and show it

@jreback
Copy link
Contributor

jreback commented Jan 27, 2019

i am ok with slightly duplicative tests if they are testing subtle behavior - if they are exact duplicates then ok to combine

@fjetter
Copy link
Member Author

fjetter commented Jan 27, 2019

The test_group_name_available_in_inference_pass is now a subset of test_group_apply_once_per_group in groupby/test_apply.py. Although the implementation is very similar I'd keep them both since they test different behavior

assert side_effects == ["x", "y"]


(:issue:`2936`, :issue:`2656`, :issue:`7739`, :issue:`10519`, :issue:`12155`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the issue numbers at the end of the top paragraph.

GroupBy.apply on ``DataFrame`` evaluates first group only once
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The implementation of ``DataFrame.groupby.apply`` previously evaluated func
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see if you can make this a proper reference, I think:

:meth:`DataFrameGroupBy.apply() <pandas.core.groupby.DataFrameGroupBy.apply>`

should work


.. code-block:: ipython

In [6]: assert side_effects == ["x", "x", "y"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just print side_effects (and it will show the 3 values)


.. ipython:: python

assert side_effects == ["x", "y"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, just show side_effects

piece = piece.copy(deep='all')
else:
mutated = True
except AttributeError:
pass

results.append(piece)
if status > 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this notion of status is pretty opaque, can you add some comments and/or make this much more obvious what you are doing.

@fjetter fjetter force-pushed the PERF/only_apply_first_row_once branch from 55a8509 to 25fa482 Compare February 21, 2019 13:56
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

The implementation of :meth:`DataFrameGroupBy.apply() <pandas.core.groupby.DataFrameGroupBy.apply>`
previously evaluated func consistently twice on the first group to infer if it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func -> the supplied function

is safe to use a fast code path. Particularly for functions with side effects,
this was an undesired behavior and may have led to surprises.

Now every group is evaluated only a single time.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the issues list in parens.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is affter the sentence above.

return group
df.groupby("a").apply(func)

Previous behavior:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use *Previous Behaviour* (not the *) and similar on New

piece = piece.copy(deep='all')
else:
mutated = True
except AttributeError:
pass

results.append(piece)
if not successfull_fast_apply:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a comment here on why you are returning early

@jreback jreback added this to the 0.25.0 milestone Feb 24, 2019
@jreback
Copy link
Contributor

jreback commented Mar 26, 2019

rebased & whatsnew updated.

@jreback
Copy link
Contributor

jreback commented Mar 26, 2019

@WillAyd if you have any more comments.

@WillAyd
Copy link
Member

WillAyd commented Mar 26, 2019 via email

@WillAyd
Copy link
Member

WillAyd commented Mar 26, 2019

CI failure looked unrelated. Merged master will merge on green

@WillAyd WillAyd merged commit 324bb84 into pandas-dev:master Mar 26, 2019
@WillAyd
Copy link
Member

WillAyd commented Mar 26, 2019

Thanks @fjetter !

elmotec added a commit to elmotec/codemetrics that referenced this pull request Sep 3, 2019
Latest version of pandas changed the behaviour of groupby and does not call it multiple times.

I believe it's related to pandas-dev/pandas#24748.
@CMCDragonkai
Copy link

Is this true only for groupby.apply? I tried test with just apply on a standard dataframe, and I found that the applied function is still being executed twice for the first row. This is quite problematic for me as my applied function is a side-effect.

@fjetter
Copy link
Member Author

fjetter commented Sep 17, 2019

@CMCDragonkai This issue is only fixed for the groupby.apply path. The ordinary DataFrame.apply uses different code paths where maybe a similar issue is buried.

@CMCDragonkai
Copy link

Are you serious! So while groupby apply no longer has duplicate side effects, but normal apply still does!? Shouldn't it be easy to apply this fix to the situation for normal dataframe apply?

@fjetter
Copy link
Member Author

fjetter commented Sep 17, 2019

@CMCDragonkai I'd need to check what is causing the duplicate execution to say for sure. My gut tells me that it should be similarly possible. I can not guarantee it without actually checking it.

@alonme
Copy link
Contributor

alonme commented Apr 28, 2020

Are there any news regarding fixing this for the df.apply?

@MarcoGorelli
Copy link
Member

Are you serious! So while groupby apply no longer has duplicate side effects, but normal apply still does!?

@CMCDragonkai this is no way to address a community contributor who has volunteered their time to submit a pull request, please refrain from using such a tone in the future - thank you :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment