ENH: Only apply first group once in fast GroupBy.apply #24748

fjetter · 2019-01-13T08:53:12Z

The issue of applying a function to the first row twice has been reported quite a few times (issues are usually referenced to #2936) and is also a documented shortcoming/implementation detail (see Notes in docs)

The argumentation is usually that this needs to be done to determine whether or not we can take a fast path for the calculation. Even if we're using the fast path for the calculation, however, the first row is still evaluated twice. I believe with this refactoring it would be possible to only eval the first row once iff the fast path is taken.
For functions with side effects or very expensive calculations in general this may be a big deal as discussed in various other issues.

I'm wondering if this small change may already help out folks. Truth be told, I'm not 100% certain about what's actually happening in this loop regarding the slider and item cache. For the use case where I encountered this, it seems to do the trick, though.

closes #2936
closes #2656
closes #7739
closes #10519
closes #12155
closes #20084
closes #21417

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

codecov · 2019-01-13T09:31:41Z

Codecov Report

Merging #24748 into master will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master   #24748   +/-   ##
=======================================
  Coverage   92.38%   92.38%           
=======================================
  Files         166      166           
  Lines       52358    52358           
=======================================
  Hits        48373    48373           
  Misses       3985     3985

Flag	Coverage Δ
#multiple	`90.81% <ø> (ø)`	⬆️
#single	`43.07% <ø> (ø)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 33f91d8...0102842. Read the comment docs.

codecov · 2019-01-13T09:31:41Z

Codecov Report

Merging #24748 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #24748      +/-   ##
==========================================
+ Coverage   91.48%   91.48%   +<.01%     
==========================================
  Files         175      175              
  Lines       52881    52885       +4     
==========================================
+ Hits        48376    48380       +4     
  Misses       4505     4505

Flag	Coverage Δ
#multiple	`90.04% <100%> (ø)`	⬆️
#single	`41.82% <0%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/groupby/ops.py	`94.21% <100%> (+0.05%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 51c6a05...ecb7294. Read the comment docs.

jreback

I am not sure you actually need to change, this rather you can modify code in pandas/core/groupby/generic.py and pandas/core/groupby/ops.py, search for fast_path.

but much more important is to actually have some tests for this.

fjetter · 2019-01-14T09:15:16Z

I don't believe that changing the python code is actually sufficient since the Cython code I am modifying evaluates the first row twice, even if everything works as expected. My change here is essentially only using the calculation for the first row/test if the test is successful instead of recalculating it again.

I believe you are referring to

pandas/pandas/core/groupby/ops.py

Line 174 in 453fa85

values, mutated = splitter.fast_apply(f, group_keys)

where fast_apply(defined here) is called and if successful, the result (calculated in Cython) is immediately returned. If an InvalidApply exception is caught, a slow, python-only path is taken. I'm not trying to optimize the slow/python path but I am arguing that even the fast_apply path will evaluate the first row twice although this isn't necessary.

Apart from implicit tests (this code section should be called at many places) there is one explicit test to ensure that the fast path can be taken and the flags for mutated are returned as expected, see

pandas/pandas/tests/groupby/test_apply.py

Line 81 in 453fa85

def test_fast_apply():

If there are other edge cases you can think of I can add more tests.

jreback

@fjetter this is not going to be acceptable w/o a test that fails before and passes now, meaning you have to show an explicit test that does this.

pandas/tests/groupby/test_groupby.py

jreback · 2019-01-16T02:01:48Z

@fjetter I would like to collect all tests from any of the references issues. After you add them see if this closes them; list the issues in the top of the PR.

pep8speaks · 2019-01-26T13:00:58Z

Hello @fjetter! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-26 03:47:26 UTC

fjetter · 2019-01-26T13:28:46Z

@jreback Sorry for the delay. I went through the referenced issues and looked at the code examples the people where posting. Mostly the examples were pretty close to each other and I classified them into two categories (see below). Essentially most reported tickets were suggesting a very simple DataFrame.groupby.apply using a function which returns a scalar. This triggers the fast path which is what my change should fix. There was only a single issue where the example triggered the slow path computation by applying the identity. (Type B: When using the identity there is a check in place which infers that the group may have been mutated which then triggers the fallback to the slow path)

I added a few more tests to make the current behavior more explicit, i.e. fast path applies row/group once but slow path still does it twice.
Although, I do have some ideas on how to improve the situation for the slow path; this would require larger changes and I do not see this happening in the scope of this PR. I believe improving the fast path alone is already offering value to the users.

Type A1 : Groupy DataFrame on col name; apply doesn't mutate (fast path)
Type A2 : Groupy DataFrame on group keys; apply doesn't mutate (fast path)
Type A3 : Groupy DataFrame on multiple names; apply doesn't mutate (fast path)

Type B: Apply DataFrame; apply identity

#2936 - Type A1 print / return new df
#2656 - Type A2 print / return len
#6753 - Type B - print / identity; slow path
#7739 - Type A1 - print / return 0
#10222 - Question
#10519 - Type A1 - print / return None
#12155 - Type A1 - print / return None
#16946 - Question
#20084 - Type A1 - return group.describe
#21417 - Type A3 - print / return None

jreback

pls add a comment with the issue number on the tests (if the test applies to multiple issue, then list both of them).

Add a whatsnew entry in 0.25.0. I would make this a sub-section in api-breaking changes, showing a mini-example and listing all of the closed issues.

jreback · 2019-01-26T15:16:06Z

pandas/tests/frame/test_apply.py

+        (1, [0, 1, 2, 3, 4, 5]),
+    ])
+    def test_apply_first_row_once(self, axis, expected):
+        df = pd.DataFrame({'a': [0, 0, 1, 1, 2, 2], 'b': np.arange(6)})


can you add the issue number as a comment

jreback · 2019-01-26T15:16:25Z

pandas/tests/frame/test_apply.py

+            rows.append(row.name)
+            return 0
+        df.apply(f_fast, axis=axis)
+        # gh-2936


ahh you have here, well move to the top

jreback · 2019-01-26T15:20:50Z

also we have a warning in the groupby.rst about this, can remove it.

fjetter · 2019-01-27T10:56:13Z

I added a whatsnew entry and added the issue references to the tests. I kept the warning in groupby.rst since the double func evaluation may still occur if we need to fall back to the slow path.

jreback · 2019-01-27T12:12:28Z

I added a whatsnew entry and added the issue references to the tests. I kept the warning in groupby.rst since the double func evaluation may still occur if we need to fall back to the slow path.

that needs to be addressed here as well - i suspect this is actually the more common problem

fjetter · 2019-01-27T16:03:16Z

I changed the implementation such that also the slow path evaluates the function only once. I refactored apply_frame_axis0 to signal the sucess/validity of the fast apply via a status code instead of an exception. If the status code is 1, we fall back to the slow path calculation but will use the already calculated result for the first group.

jreback · 2019-01-27T16:12:03Z

pandas/tests/groupby/test_groupby.py

@@ -1420,20 +1420,30 @@ def foo(x):

 def test_group_name_available_in_inference_pass():
    # gh-15062
+    # GH24748 ,GH2936, GH2656, GH7739, GH10519, GH12155, GH20084, GH21417


don’t add all the issues, just the ones that apply to each test; then it should be easy to see that we have all of the issues covered

I'll remove the links from this test since this test has been there before and is supposed to test whether the group name is available. The connection to this issue is merely coincidental.

For the (very similar) test in groupby/test_apply.py I added all these ticket numbers since you asked me to go through the issues and collect the tests. They are all duplicates and provide only very little additional information. All of them are linked to #2936 Is it ok if I remove all duplicates and only keep that one? same question for the whatsnew: list all duplicates or just the one reference issue?

doc/source/whatsnew/v0.25.0.rst

jreback · 2019-01-27T16:12:51Z

doc/source/whatsnew/v0.25.0.rst

+
+.. code-block:: ipython
+
+    In [2]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})


define df in a separate block above and show it

jreback · 2019-01-27T17:09:50Z

i am ok with slightly duplicative tests if they are testing subtle behavior - if they are exact duplicates then ok to combine

fjetter · 2019-01-27T17:13:57Z

The test_group_name_available_in_inference_pass is now a subset of test_group_apply_once_per_group in groupby/test_apply.py. Although the implementation is very similar I'd keep them both since they test different behavior

jreback · 2019-02-02T23:24:38Z

doc/source/whatsnew/v0.25.0.rst

+    assert side_effects == ["x", "y"]
+
+
+(:issue:`2936`, :issue:`2656`, :issue:`7739`, :issue:`10519`, :issue:`12155`,


put the issue numbers at the end of the top paragraph.

jreback · 2019-02-02T23:25:56Z

doc/source/whatsnew/v0.25.0.rst

+GroupBy.apply on ``DataFrame`` evaluates first group only once
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The implementation of ``DataFrame.groupby.apply`` previously evaluated func


see if you can make this a proper reference, I think:

:meth:`DataFrameGroupBy.apply() <pandas.core.groupby.DataFrameGroupBy.apply>`

should work

jreback · 2019-02-02T23:26:33Z

doc/source/whatsnew/v0.25.0.rst

+
+.. code-block:: ipython
+
+    In [6]: assert side_effects == ["x", "x", "y"]


just print side_effects (and it will show the 3 values)

jreback · 2019-02-02T23:27:40Z

doc/source/whatsnew/v0.25.0.rst

+
+.. ipython:: python
+
+    assert side_effects == ["x", "y"]


same here, just show side_effects

jreback · 2019-02-02T23:29:53Z

pandas/_libs/reduction.pyx

                    piece = piece.copy(deep='all')
                else:
                    mutated = True
            except AttributeError:
                pass

            results.append(piece)
+            if status > 0:


this notion of status is pretty opaque, can you add some comments and/or make this much more obvious what you are doing.

pandas/tests/groupby/test_apply.py

jreback · 2019-02-21T13:59:21Z

doc/source/whatsnew/v0.25.0.rst

+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The implementation of :meth:`DataFrameGroupBy.apply() <pandas.core.groupby.DataFrameGroupBy.apply>`
+previously evaluated func consistently twice on the first group to infer if it


func -> the supplied function

jreback · 2019-02-21T13:59:52Z

doc/source/whatsnew/v0.25.0.rst

+is safe to use a fast code path. Particularly for functions with side effects,
+this was an undesired behavior and may have led to surprises.
+
+Now every group is evaluated only a single time.


put the issues list in parens.

it is affter the sentence above.

jreback · 2019-02-21T14:00:53Z

doc/source/whatsnew/v0.25.0.rst

+        return group
+    df.groupby("a").apply(func)
+
+Previous behavior:


use *Previous Behaviour* (not the *) and similar on New

pandas/_libs/reduction.pyx

jreback · 2019-02-21T14:02:00Z

pandas/_libs/reduction.pyx

                    piece = piece.copy(deep='all')
                else:
                    mutated = True
            except AttributeError:
                pass

            results.append(piece)
+            if not successfull_fast_apply:


add a comment here on why you are returning early

pandas/core/groupby/ops.py

pandas/tests/groupby/test_apply.py

jreback · 2019-03-26T00:39:42Z

rebased & whatsnew updated.

jreback · 2019-03-26T01:02:30Z

@WillAyd if you have any more comments.

WillAyd · 2019-03-26T01:43:51Z

Thanks Jeff I’ll take a look tonight and merge if good

…

Sent from my iPhone

On Mar 25, 2019, at 6:02 PM, Jeff Reback ***@***.***> wrote: @WillAyd if you have any more comments. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

…irst_row_once

WillAyd · 2019-03-26T03:48:05Z

CI failure looked unrelated. Merged master will merge on green

WillAyd · 2019-03-26T04:38:42Z

Thanks @fjetter !

Latest version of pandas changed the behaviour of groupby and does not call it multiple times. I believe it's related to pandas-dev/pandas#24748.

CMCDragonkai · 2019-09-13T08:01:27Z

Is this true only for groupby.apply? I tried test with just apply on a standard dataframe, and I found that the applied function is still being executed twice for the first row. This is quite problematic for me as my applied function is a side-effect.

fjetter · 2019-09-17T09:46:43Z

@CMCDragonkai This issue is only fixed for the groupby.apply path. The ordinary DataFrame.apply uses different code paths where maybe a similar issue is buried.

CMCDragonkai · 2019-09-17T09:50:33Z

Are you serious! So while groupby apply no longer has duplicate side effects, but normal apply still does!? Shouldn't it be easy to apply this fix to the situation for normal dataframe apply?

fjetter · 2019-09-17T10:03:26Z

@CMCDragonkai I'd need to check what is causing the duplicate execution to say for sure. My gut tells me that it should be similarly possible. I can not guarantee it without actually checking it.

alonme · 2020-04-28T07:41:21Z

Are there any news regarding fixing this for the df.apply?

MarcoGorelli · 2020-05-02T09:38:12Z

Are you serious! So while groupby apply no longer has duplicate side effects, but normal apply still does!?

@CMCDragonkai this is no way to address a community contributor who has volunteered their time to submit a pull request, please refrain from using such a tone in the future - thank you :)

fjetter force-pushed the PERF/only_apply_first_row_once branch from 0102842 to 7354f02 Compare January 13, 2019 09:38

jreback requested changes Jan 13, 2019

View reviewed changes

jreback added the Groupby label Jan 13, 2019

jreback requested changes Jan 14, 2019

View reviewed changes

fjetter commented Jan 15, 2019

View reviewed changes

pandas/tests/groupby/test_groupby.py Show resolved Hide resolved

fjetter force-pushed the PERF/only_apply_first_row_once branch 2 times, most recently from 9afa234 to 1c7eb85 Compare January 26, 2019 14:21

jreback requested changes Jan 26, 2019

View reviewed changes

fjetter force-pushed the PERF/only_apply_first_row_once branch from 1c7eb85 to b9507a2 Compare January 27, 2019 10:47

fjetter changed the title ~~ENH / RFC Only apply first row once in fast apply~~ ENH Only apply first group once in fast GroupBy.apply Jan 27, 2019

fjetter changed the title ~~ENH Only apply first group once in fast GroupBy.apply~~ ENH: Only apply first group once in fast GroupBy.apply Jan 27, 2019

fjetter force-pushed the PERF/only_apply_first_row_once branch from b9507a2 to fcf8455 Compare January 27, 2019 10:52

fjetter force-pushed the PERF/only_apply_first_row_once branch from fcf8455 to ab2c65d Compare January 27, 2019 15:59

jreback requested changes Jan 27, 2019

View reviewed changes

jreback requested changes Feb 2, 2019

View reviewed changes

fjetter force-pushed the PERF/only_apply_first_row_once branch from 55a8509 to 25fa482 Compare February 21, 2019 13:56

jreback requested changes Feb 21, 2019

View reviewed changes

jreback added this to the 0.25.0 milestone Feb 24, 2019

jreback added 2 commits March 25, 2019 20:31

Merge branch 'master' into PR_TOOL_MERGE_PR_24748

f57ce64

use printing in example

feb39eb

jreback added 3 commits March 25, 2019 20:40

fix whatsnew

67a2909

reformat tests

2f36c30

small clean in reduction.pyx

b52f206

jreback approved these changes Mar 26, 2019

View reviewed changes

Merge remote-tracking branch 'upstream/master' into PERF/only_apply_f…

ecb7294

…irst_row_once

WillAyd approved these changes Mar 26, 2019

View reviewed changes

WillAyd merged commit 324bb84 into pandas-dev:master Mar 26, 2019

WillAyd mentioned this pull request Mar 26, 2019

Performance regression in 0.24+ on GroupBy.apply #25883

Closed

jreback mentioned this pull request Mar 28, 2019

groupby-apply got different output with for loop in same action #25906

Closed

WillAyd mentioned this pull request Apr 4, 2019

GroupBy.apply repeats the first group N times when slicing and sorting in the applied function #25892

Closed

TomAugspurger mentioned this pull request Aug 2, 2019

pandas.DataFrame.apply runs twice if an exception is raised #27697

Closed

elmotec added a commit to elmotec/codemetrics that referenced this pull request Sep 3, 2019

Fixed test_core.py

a5ca360

Latest version of pandas changed the behaviour of groupby and does not call it multiple times. I believe it's related to pandas-dev/pandas#24748.

Aretno mentioned this pull request Feb 26, 2020

Add Beersheba - the deconvolution city next-exp/IC#666

Merged

alonme mentioned this pull request Apr 29, 2020

DOC: Fix Dataframe.apply documentation to include note regarding calling function twice on first element #33879

Closed

WillAyd mentioned this pull request May 6, 2020

Fix Dataframe.apply documentation to include note regarding calling f… #33880

Closed

5 tasks

WillAyd mentioned this pull request May 20, 2020

BUG/ENH: Fix apply to only call func once on the first column/row #34183

Merged

4 tasks

Rohith295 mentioned this pull request Jun 20, 2020

TST: groupby apply called multiple times #34897

Merged

3 tasks


		.. code-block:: ipython

		In [2]: df = pd.DataFrame({"a": ["x", "y"], "b": [1, 2]})

		assert side_effects == ["x", "y"]


		(:issue:`2936`, :issue:`2656`, :issue:`7739`, :issue:`10519`, :issue:`12155`,


		.. code-block:: ipython

		In [6]: assert side_effects == ["x", "x", "y"]

ENH: Only apply first group once in fast GroupBy.apply #24748

ENH: Only apply first group once in fast GroupBy.apply #24748

Conversation

fjetter commented Jan 13, 2019 • edited Loading

codecov bot commented Jan 13, 2019

Codecov Report

codecov bot commented Jan 13, 2019 • edited Loading

Codecov Report

jreback left a comment

Choose a reason for hiding this comment

fjetter commented Jan 14, 2019 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

jreback commented Jan 16, 2019

pep8speaks commented Jan 26, 2019 • edited Loading

Comment last updated at 2019-03-26 03:47:26 UTC

fjetter commented Jan 26, 2019

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 26, 2019

fjetter commented Jan 27, 2019

jreback commented Jan 27, 2019

fjetter commented Jan 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Jan 27, 2019

fjetter commented Jan 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 26, 2019

jreback commented Mar 26, 2019

WillAyd commented Mar 26, 2019 via email

WillAyd commented Mar 26, 2019

WillAyd commented Mar 26, 2019

CMCDragonkai commented Sep 13, 2019

fjetter commented Sep 17, 2019

CMCDragonkai commented Sep 17, 2019

fjetter commented Sep 17, 2019

alonme commented Apr 28, 2020

MarcoGorelli commented May 2, 2020

fjetter commented Jan 13, 2019 •

edited

Loading

codecov bot commented Jan 13, 2019 •

edited

Loading

fjetter commented Jan 14, 2019 •

edited

Loading

pep8speaks commented Jan 26, 2019 •

edited

Loading