BUG: groupby.shift returns different columns when fill_value is specified #41858

smithto1 · 2021-06-07T21:45:10Z

closes BUG: groupby.shift return keys filled with fill_value when fill_value is specified #41556
closes Performance issue while groupby.shift if fill_value specified #26615
tests added / passed
Ensure all linting tests pass, see here for how to run them
whatsnew entry

This fixes the minimal reproducing examples from the original bug report #41556.

With master, specifying fill_value causes the index columns to be returned and to include the fill_value in the grouping columns.

In [4]: import pandas as pd
   ...: 
   ...: df = pd.DataFrame({'a': [2, 1, 2, 1], 'b': ['x', 'x', 'y', 'y']})
   ...: 
   ...: df.groupby('a').shift(1)
Out[4]:
     b
0  NaN
1  NaN
2    x
3    x
In [5]: df.groupby('a').shift(1, fill_value='fill')
Out[5]:
      a     b
0  fill  fill
1  fill  fill
2     2     x
3     1     x

With this PR, only the value columns are returned in both cases, and the fill is applied correctly.

In [1]: import pandas as pd
   ...: 
   ...: # On master, using fill_value causes the index columns to be returned
   ...: df = pd.DataFrame({'a': [2, 1, 2, 1], 'b': ['x', 'x', 'y', 'y']})
   ...: 
   ...: df.groupby('a').shift(1)
Out[1]:
     b
0  NaN
1  NaN
2    x
3    x
In [2]: df.groupby('a').shift(1, fill_value='fill')
Out[2]:
      b
0  fill
1  fill
2     x
3     x

On master, if fill_value=None then _get_cythonized_result was used. But if fill_value was specified, then self.apply was used, because _get_cythonized_result couldn't take the fill_value. I've updated _get_cythonized_result so it can handle the fill_value itself. This means Groupby.shift follows the same code path and returns the same structure of output, whether or not fill_value is specified.

Test
The bug reported in the original issue should have been caught by an existing test, except that the existing test extracts the values columns ( [['Z']] ), before doing its comparison to the expected output. This was hiding the fact that the structure of the result depended on fill_value being specified.

https://github.com/pandas-dev/pandas/blob/1.2.x/pandas/tests/groupby/test_groupby_shift_diff.py#L52

Removing the extraction of the data column ( [['Z']] ) means we now have a test that catches the bug reported in the original issue. (The test fails on master, but passes on this PR).

…path to be used with fill_value!=None. The existing test was extracting only the values column and ignoring that the index columns were also returned, which masked the bug reported in the issue

WillAyd · 2021-06-08T05:45:38Z

Thanks for the PR. I think that dispatching to self.apply here is going to cause a rather large performance regression though - can check the benchmarks to see?

We likely need to patch elsewhere in the groupby code to make ensure that the correct structure is returned

smithto1 · 2021-06-08T06:38:36Z

Thanks for the PR. I think that dispatching to self.apply here is going to cause a rather large performance regression though - can check the benchmarks to see?

We likely need to patch elsewhere in the groupby code to make ensure that the correct structure is returned

@WillAyd, I think there is a misunderstanding here. I have removed the dispatch to self.apply and switched it to _get_cynthonized_result. I haven't checked but assume performance is improved.

Also, the structure of the result no longer depends on whether you specify fill_value or not.

I've updated the original PR description to make the changes clearer.

simonjayhawkins · 2021-06-08T08:02:42Z

@WillAyd, I think there is a misunderstanding here. I have removed the dispatch to self.apply and switched it to _get_cynthonized_result. I haven't checked but would assume performance would improve.

does this close #26615

smithto1 · 2021-06-08T09:40:53Z

does this close #26615

Yes, it does, @simonjayhawkins . Using the example from #26615, they now run at same speed even when fill_value=0 and preserve the dtype of uint8.

In [1]: import numpy as np
   ...: import pandas as pd
   ...: 
   ...: SIZE_MULT = 5
   ...: data = np.random.randint(0, 255, size=10**SIZE_MULT, dtype='uint8')
   ...: index = pd.MultiIndex.from_product(
   ...:             [list(range(10**(SIZE_MULT-1))), list('ABCDEFGHIJ')],
   ...:             names = ['d', 'l'])        
   ...: test = pd.DataFrame(data, index, columns = ['data'])
   ...: print(test.dtypes['data'])
   ...:
uint8

In [2]: %%time
   ...: shifted = test.groupby(axis=0, level=[0]).shift(2)
   ...: print(shifted.dtypes['data'])
   ...:
   ...:
float64
Wall time: 15.6 ms

In [3]: %%time
   ...: shifted = test.groupby(axis=0, level=[0]).shift(2, fill_value = 0)
   ...: print(shifted.dtypes['data'])
   ...:
   ...:
uint8
Wall time: 10.9 ms

WillAyd

Thanks for clarifying. This lgtm

smithto1 · 2021-06-14T07:41:32Z

@WillAyd Is there anything else we need to have this merged in?

jreback

needs tests that reproduce the original issues
release note needs to be update to 1.4
needs review

smithto1 · 2021-06-14T13:40:23Z

needs tests that reproduce the original issues

This PR includes a modification to an existing test which ensures the test recreates the original issue. The modified test fails on master because of the original issue (see below). The test passes with the code changes in this PR.

Previously, the test extracted the value column ( [['Z']] ) from the result, ignoring the fact that the result contained additional columns that should have caused an error. This is fixed now. Fixing the existing test seemed more sensible than adding a new one.

err1429.txt

================================== FAILURES ===================================
______________________ test_group_shift_with_fill_value _______________________

    def test_group_shift_with_fill_value():
        # GH #24128
        n_rows = 24
        df = DataFrame(
            [(i % 12, i % 3, i) for i in range(n_rows)],
            dtype=float,
            columns=["A", "B", "Z"],
            index=None,
        )
        g = df.groupby(["A", "B"])
    
        expected = DataFrame(
            [(i + 12 if i < n_rows - 12 else 0) for i in range(n_rows)],
            dtype=float,
            columns=["Z"],
            index=None,
        )
        result = g.shift(-1, fill_value=0)
    
>       tm.assert_frame_equal(result, expected)
E       AssertionError: DataFrame are different
E       
E       DataFrame shape mismatch
E       [left]:  (24, 3)
E       [right]: (24, 1)

test_groupby_shift_diff.py:60: AssertionError

release note needs to be update to 1.4

Done.

needs review

Ready for review, @jreback

smithto1 · 2021-06-25T10:56:50Z

@jreback can you take another look at this? I think it is ready to merge.

jreback

looks good. can you make sure we have an asv for this case (or add one) and show the results. ping on green.

I think testing is ok, though if you can look in the original issue and make sure we have a replicated test.

pep8speaks · 2021-06-25T23:01:04Z

Hello @smithto1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-28 15:12:50 UTC

smithto1 · 2021-06-25T23:06:22Z

looks good. can you make sure we have an asv for this case (or add one) and show the results. ping on green.

I think testing is ok, though if you can look in the original issue and make sure we have a replicated test.

@jreback please have another look, I think this is good to go.

With this PR, the testing now replicates the original issue. I've looked into it.

asv is added. Below, you can see that on master, using fill_value is slower. But on this PR branch, using fill_value runs in same time.

master

· Discovering benchmarks
· Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks)
[  0.00%] ·· Benchmarking existing-pyc__anaconda3_envs_pandas-dev_python.exe
[ 25.00%] ··· groupby.Shift.time_defaults                              1.89±0ms
[ 50.00%] ··· groupby.Shift.time_fill_value                            3.47±0ms

PR

· Discovering benchmarks
· Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks)
[  0.00%] ·· Benchmarking existing-pyc__anaconda3_envs_pandas-dev_python.exe
[ 25.00%] ··· groupby.Shift.time_defaults                              1.33±0ms
[ 50.00%] ··· groupby.Shift.time_fill_value                            1.29±0ms

rhshadrach

lgtm, minor request for the whatsnew

doc/source/whatsnew/v1.4.0.rst

rhshadrach · 2021-07-23T01:59:27Z

@smithto1 - friendly ping for resolving conflict and whatsnew request.

smithto1 · 2021-07-26T10:39:02Z

@smithto1 - friendly ping for resolving conflict and whatsnew request.

That is addressed. I think this is ready to merge.

@rhshadrach @jreback @WillAyd

rhshadrach

lgtm

jreback

lgtm. test coverage question. ping on green.

jreback · 2021-07-28T01:22:26Z

pandas/tests/groupby/test_groupby_shift_diff.py

@@ -55,7 +55,7 @@ def test_group_shift_with_fill_value():
        columns=["Z"],
        index=None,
    )
-    result = g.shift(-1, fill_value=0)[["Z"]]
+    result = g.shift(-1, fill_value=0)


do we have sufficient coverage of the tests from the OP?

Yes, @jreback. I have investigated it.

The original problem was that the grouping columns were improperly returned. The old form of this test would extract the value column ([["Z"]]), so the test did not detect that the grouping columns were returned. I have modified this test so it no longer extracts the value column; if the grouping columns are returned (the OP) this test will now fail.

kk great

if u can resolve conflicts and merge master

@jreback conflicts resolved. I think it can be merged.

jreback · 2021-07-28T22:55:27Z

thanks @smithto1 for the patch and the patience! keep em coming!

…fied (pandas-dev#41858)

* TST: Fix doctests for pandas.io.formats.style * Modified: pandas/io/formats/style.py * Added some expected results * Skipped some tests * TST: Add link to redirect to Table Visualization user guide * Modified style.py * Updated the doctest of the apply() * Updated the doctest of the applymap() * Updated the doctest of the set_table_styles() * Updated the doctest of the set_properties() * TST: Add image to pipe function result * Modified style.py * Updated the doctest of the pipe() * TST: Remove unnecessary outputs * Modified pandas/io/formats/style.py * Updated the doctests of the set_tooltips() * Updated the doctests of the to_latex() * Updated the doctests of the set_td_classes() * Updated the doctests of the set_table_attributes() * TST: Add the output to the Styler.format doctest in to_latex() * REG: DataFrame.agg where func returns lists and axis=1 (#42762) * Fix typing issues for CI (#42770) * BUG: groupby.shift returns different columns when fill_value is specified (#41858) * PERF: extract_array earlier in DataFrame construction (#42774) * ENH: `sparse_columns` and `sparse_index` added to `Styler.to_html` (#41946) * TYP: Fix typing for searchsorted (#42788) * DOC GH42756 Update documentation for pandas.DataFrame.drop to clarify tuples. (#42789) * CI: Fix doctests (#42790) * REGR: nanosecond timestamp comparisons to OOB datetimes (#42796) * COMPAT: MPL 3.4.0 (#42803) * Delete duplicates and unused code from reshape tests (#42802) * REGR: ValueError raised when both prefix and names are set to None (#42690) * REGR: ValueError raised when both prefix and names are set to None * Update readers.py * whitespace * Update v1.3.1.rst * Update v1.3.2.rst * Update readers.py * Update readers.py Co-authored-by: Jeff Reback <[email protected]> * TST: Add style.py to the doctest check * TST: fixed eng_formatter doctest for #42671 (#42705) * TST: Revert x and y position in some doctests * Updated the doctest of the hide_columns() Co-authored-by: Richard Shadrach <[email protected]> Co-authored-by: Irv Lustig <[email protected]> Co-authored-by: Thomas Smith <[email protected]> Co-authored-by: jbrockmendel <[email protected]> Co-authored-by: attack68 <[email protected]> Co-authored-by: Mike Phung <[email protected]> Co-authored-by: Matthew Zeitlin <[email protected]> Co-authored-by: Thomas Li <[email protected]> Co-authored-by: Patrick Hoefler <[email protected]> Co-authored-by: Jeff Reback <[email protected]> Co-authored-by: Krishna Chivukula <[email protected]>

…fied (pandas-dev#41858)

* TST: Fix doctests for pandas.io.formats.style * Modified: pandas/io/formats/style.py * Added some expected results * Skipped some tests * TST: Add link to redirect to Table Visualization user guide * Modified style.py * Updated the doctest of the apply() * Updated the doctest of the applymap() * Updated the doctest of the set_table_styles() * Updated the doctest of the set_properties() * TST: Add image to pipe function result * Modified style.py * Updated the doctest of the pipe() * TST: Remove unnecessary outputs * Modified pandas/io/formats/style.py * Updated the doctests of the set_tooltips() * Updated the doctests of the to_latex() * Updated the doctests of the set_td_classes() * Updated the doctests of the set_table_attributes() * TST: Add the output to the Styler.format doctest in to_latex() * REG: DataFrame.agg where func returns lists and axis=1 (pandas-dev#42762) * Fix typing issues for CI (pandas-dev#42770) * BUG: groupby.shift returns different columns when fill_value is specified (pandas-dev#41858) * PERF: extract_array earlier in DataFrame construction (pandas-dev#42774) * ENH: `sparse_columns` and `sparse_index` added to `Styler.to_html` (pandas-dev#41946) * TYP: Fix typing for searchsorted (pandas-dev#42788) * DOC GH42756 Update documentation for pandas.DataFrame.drop to clarify tuples. (pandas-dev#42789) * CI: Fix doctests (pandas-dev#42790) * REGR: nanosecond timestamp comparisons to OOB datetimes (pandas-dev#42796) * COMPAT: MPL 3.4.0 (pandas-dev#42803) * Delete duplicates and unused code from reshape tests (pandas-dev#42802) * REGR: ValueError raised when both prefix and names are set to None (pandas-dev#42690) * REGR: ValueError raised when both prefix and names are set to None * Update readers.py * whitespace * Update v1.3.1.rst * Update v1.3.2.rst * Update readers.py * Update readers.py Co-authored-by: Jeff Reback <[email protected]> * TST: Add style.py to the doctest check * TST: fixed eng_formatter doctest for pandas-dev#42671 (pandas-dev#42705) * TST: Revert x and y position in some doctests * Updated the doctest of the hide_columns() Co-authored-by: Richard Shadrach <[email protected]> Co-authored-by: Irv Lustig <[email protected]> Co-authored-by: Thomas Smith <[email protected]> Co-authored-by: jbrockmendel <[email protected]> Co-authored-by: attack68 <[email protected]> Co-authored-by: Mike Phung <[email protected]> Co-authored-by: Matthew Zeitlin <[email protected]> Co-authored-by: Thomas Li <[email protected]> Co-authored-by: Patrick Hoefler <[email protected]> Co-authored-by: Jeff Reback <[email protected]> Co-authored-by: Krishna Chivukula <[email protected]>

smithto1 added 4 commits June 7, 2021 22:09

pandas-dev#41556 passing fill_value into take_nd() allows cythonized …

cf445fd

…path to be used with fill_value!=None. The existing test was extracting only the values column and ignoring that the index columns were also returned, which masked the bug reported in the issue

Merge remote-tracking branch 'upstream/master' into issue41556

af4763d

whatsnew

232e317

black

dfc758d

WillAyd added the Groupby label Jun 8, 2021

WillAyd approved these changes Jun 10, 2021

View reviewed changes

simonjayhawkins added the Performance Memory or execution speed performance label Jun 10, 2021

Merge remote-tracking branch 'upstream/master' into issue41556

6c23adb

jreback requested changes Jun 14, 2021

View reviewed changes

whatsnew 1.4

e7236ac

Merge remote-tracking branch 'upstream/master' into issue41556

006048c

smithto1 requested a review from jreback June 25, 2021 10:56

jreback added this to the 1.4 milestone Jun 25, 2021

jreback requested changes Jun 25, 2021

View reviewed changes

add ASV

5f9478f

black

674b3cb

rhshadrach requested changes Jul 8, 2021

View reviewed changes

doc/source/whatsnew/v1.4.0.rst Outdated Show resolved Hide resolved

whatsnew conflict

c92ca78

rhshadrach approved these changes Jul 27, 2021

View reviewed changes

fix whatsnew conflict

60cb771

jreback requested changes Jul 28, 2021

View reviewed changes

whatsnew conflict

85afab0

jreback approved these changes Jul 28, 2021

View reviewed changes

jreback merged commit 94a7418 into pandas-dev:master Jul 28, 2021

CGe0516 pushed a commit to CGe0516/pandas that referenced this pull request Jul 29, 2021

BUG: groupby.shift returns different columns when fill_value is speci…

da61298

…fied (pandas-dev#41858)

Leonardofreua pushed a commit to Leonardofreua/pandas that referenced this pull request Jul 30, 2021

BUG: groupby.shift returns different columns when fill_value is speci…

5ba05f5

…fied (pandas-dev#41858)

feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021

BUG: groupby.shift returns different columns when fill_value is speci…

b7ce336

…fied (pandas-dev#41858)

Uh oh!

BUG: groupby.shift returns different columns when fill_value is specified #41858

BUG: groupby.shift returns different columns when fill_value is specified #41858

Uh oh!

Conversation

smithto1 commented Jun 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd commented Jun 8, 2021

Uh oh!

smithto1 commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simonjayhawkins commented Jun 8, 2021

Uh oh!

smithto1 commented Jun 8, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

smithto1 commented Jun 14, 2021

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

smithto1 commented Jun 14, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smithto1 commented Jun 25, 2021

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented Jun 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2021-07-28 15:12:50 UTC

Uh oh!

smithto1 commented Jun 25, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rhshadrach commented Jul 23, 2021

Uh oh!

smithto1 commented Jul 26, 2021

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback Jul 28, 2021

Choose a reason for hiding this comment

Uh oh!

smithto1 Jul 28, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback Jul 28, 2021

Choose a reason for hiding this comment

Uh oh!

smithto1 Jul 28, 2021

Choose a reason for hiding this comment

Uh oh!

jreback commented Jul 28, 2021

Uh oh!

Uh oh!

smithto1 commented Jun 7, 2021 •

edited

Loading

smithto1 commented Jun 8, 2021 •

edited

Loading

smithto1 commented Jun 8, 2021 •

edited

Loading

smithto1 commented Jun 14, 2021 •

edited

Loading

pep8speaks commented Jun 25, 2021 •

edited

Loading

smithto1 commented Jun 25, 2021 •

edited

Loading

smithto1 Jul 28, 2021 •

edited

Loading