Skip to content

BUG: groupby.shift returns different columns when fill_value is specified #41858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 28, 2021

Conversation

smithto1
Copy link
Member

@smithto1 smithto1 commented Jun 7, 2021

This fixes the minimal reproducing examples from the original bug report #41556.

With master, specifying fill_value causes the index columns to be returned and to include the fill_value in the grouping columns.

In [4]: import pandas as pd
   ...: 
   ...: df = pd.DataFrame({'a': [2, 1, 2, 1], 'b': ['x', 'x', 'y', 'y']})
   ...: 
   ...: df.groupby('a').shift(1)
Out[4]:
     b
0  NaN
1  NaN
2    x
3    x
In [5]: df.groupby('a').shift(1, fill_value='fill')
Out[5]:
      a     b
0  fill  fill
1  fill  fill
2     2     x
3     1     x

With this PR, only the value columns are returned in both cases, and the fill is applied correctly.

In [1]: import pandas as pd
   ...: 
   ...: # On master, using fill_value causes the index columns to be returned
   ...: df = pd.DataFrame({'a': [2, 1, 2, 1], 'b': ['x', 'x', 'y', 'y']})
   ...: 
   ...: df.groupby('a').shift(1)
Out[1]:
     b
0  NaN
1  NaN
2    x
3    x
In [2]: df.groupby('a').shift(1, fill_value='fill')
Out[2]:
      b
0  fill
1  fill
2     x
3     x

On master, if fill_value=None then _get_cythonized_result was used. But if fill_value was specified, then self.apply was used, because _get_cythonized_result couldn't take the fill_value. I've updated _get_cythonized_result so it can handle the fill_value itself. This means Groupby.shift follows the same code path and returns the same structure of output, whether or not fill_value is specified.

Test
The bug reported in the original issue should have been caught by an existing test, except that the existing test extracts the values columns ( [['Z']] ), before doing its comparison to the expected output. This was hiding the fact that the structure of the result depended on fill_value being specified.

https://github.com/pandas-dev/pandas/blob/1.2.x/pandas/tests/groupby/test_groupby_shift_diff.py#L52

Removing the extraction of the data column ( [['Z']] ) means we now have a test that catches the bug reported in the original issue. (The test fails on master, but passes on this PR).

smithto1 added 4 commits June 7, 2021 22:09
…path to be used with fill_value!=None. The existing test was extracting only the values column and ignoring that the index columns were also returned, which masked the bug reported in the issue
@WillAyd
Copy link
Member

WillAyd commented Jun 8, 2021

Thanks for the PR. I think that dispatching to self.apply here is going to cause a rather large performance regression though - can check the benchmarks to see?

We likely need to patch elsewhere in the groupby code to make ensure that the correct structure is returned

@WillAyd WillAyd added the Groupby label Jun 8, 2021
@smithto1
Copy link
Member Author

smithto1 commented Jun 8, 2021

Thanks for the PR. I think that dispatching to self.apply here is going to cause a rather large performance regression though - can check the benchmarks to see?

We likely need to patch elsewhere in the groupby code to make ensure that the correct structure is returned

@WillAyd, I think there is a misunderstanding here. I have removed the dispatch to self.apply and switched it to _get_cynthonized_result. I haven't checked but assume performance is improved.

Also, the structure of the result no longer depends on whether you specify fill_value or not.

I've updated the original PR description to make the changes clearer.

@simonjayhawkins
Copy link
Member

@WillAyd, I think there is a misunderstanding here. I have removed the dispatch to self.apply and switched it to _get_cynthonized_result. I haven't checked but would assume performance would improve.

does this close #26615

@smithto1
Copy link
Member Author

smithto1 commented Jun 8, 2021

does this close #26615

Yes, it does, @simonjayhawkins . Using the example from #26615, they now run at same speed even when fill_value=0 and preserve the dtype of uint8.

In [1]: import numpy as np
   ...: import pandas as pd
   ...: 
   ...: SIZE_MULT = 5
   ...: data = np.random.randint(0, 255, size=10**SIZE_MULT, dtype='uint8')
   ...: index = pd.MultiIndex.from_product(
   ...:             [list(range(10**(SIZE_MULT-1))), list('ABCDEFGHIJ')],
   ...:             names = ['d', 'l'])        
   ...: test = pd.DataFrame(data, index, columns = ['data'])
   ...: print(test.dtypes['data'])
   ...:
uint8

In [2]: %%time
   ...: shifted = test.groupby(axis=0, level=[0]).shift(2)
   ...: print(shifted.dtypes['data'])
   ...:
   ...:
float64
Wall time: 15.6 ms

In [3]: %%time
   ...: shifted = test.groupby(axis=0, level=[0]).shift(2, fill_value = 0)
   ...: print(shifted.dtypes['data'])
   ...:
   ...:
uint8
Wall time: 10.9 ms

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. This lgtm

@simonjayhawkins simonjayhawkins added the Performance Memory or execution speed performance label Jun 10, 2021
@smithto1
Copy link
Member Author

@WillAyd Is there anything else we need to have this merged in?

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs tests that reproduce the original issues
release note needs to be update to 1.4
needs review

@smithto1
Copy link
Member Author

smithto1 commented Jun 14, 2021

needs tests that reproduce the original issues

This PR includes a modification to an existing test which ensures the test recreates the original issue. The modified test fails on master because of the original issue (see below). The test passes with the code changes in this PR.

Previously, the test extracted the value column ( [['Z']] ) from the result, ignoring the fact that the result contained additional columns that should have caused an error. This is fixed now. Fixing the existing test seemed more sensible than adding a new one.

err1429.txt

================================== FAILURES ===================================
______________________ test_group_shift_with_fill_value _______________________

    def test_group_shift_with_fill_value():
        # GH #24128
        n_rows = 24
        df = DataFrame(
            [(i % 12, i % 3, i) for i in range(n_rows)],
            dtype=float,
            columns=["A", "B", "Z"],
            index=None,
        )
        g = df.groupby(["A", "B"])
    
        expected = DataFrame(
            [(i + 12 if i < n_rows - 12 else 0) for i in range(n_rows)],
            dtype=float,
            columns=["Z"],
            index=None,
        )
        result = g.shift(-1, fill_value=0)
    
>       tm.assert_frame_equal(result, expected)
E       AssertionError: DataFrame are different
E       
E       DataFrame shape mismatch
E       [left]:  (24, 3)
E       [right]: (24, 1)

test_groupby_shift_diff.py:60: AssertionError

release note needs to be update to 1.4

Done.

needs review

Ready for review, @jreback

@smithto1 smithto1 requested a review from jreback June 25, 2021 10:56
@smithto1
Copy link
Member Author

@jreback can you take another look at this? I think it is ready to merge.

@jreback jreback added this to the 1.4 milestone Jun 25, 2021
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. can you make sure we have an asv for this case (or add one) and show the results. ping on green.

I think testing is ok, though if you can look in the original issue and make sure we have a replicated test.

@pep8speaks
Copy link

pep8speaks commented Jun 25, 2021

Hello @smithto1! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-07-28 15:12:50 UTC

@smithto1
Copy link
Member Author

smithto1 commented Jun 25, 2021

looks good. can you make sure we have an asv for this case (or add one) and show the results. ping on green.

I think testing is ok, though if you can look in the original issue and make sure we have a replicated test.

@jreback please have another look, I think this is good to go.

With this PR, the testing now replicates the original issue. I've looked into it.

asv is added. Below, you can see that on master, using fill_value is slower. But on this PR branch, using fill_value runs in same time.

master

· Discovering benchmarks
· Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks)
[  0.00%] ·· Benchmarking existing-pyc__anaconda3_envs_pandas-dev_python.exe
[ 25.00%] ··· groupby.Shift.time_defaults                              1.89±0ms
[ 50.00%] ··· groupby.Shift.time_fill_value                            3.47±0ms

PR

· Discovering benchmarks
· Running 2 total benchmarks (1 commits * 1 environments * 2 benchmarks)
[  0.00%] ·· Benchmarking existing-pyc__anaconda3_envs_pandas-dev_python.exe
[ 25.00%] ··· groupby.Shift.time_defaults                              1.33±0ms
[ 50.00%] ··· groupby.Shift.time_fill_value                            1.29±0ms

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, minor request for the whatsnew

@rhshadrach
Copy link
Member

@smithto1 - friendly ping for resolving conflict and whatsnew request.

@smithto1
Copy link
Member Author

@smithto1 - friendly ping for resolving conflict and whatsnew request.

That is addressed. I think this is ready to merge.

@rhshadrach @jreback @WillAyd

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. test coverage question. ping on green.

@@ -55,7 +55,7 @@ def test_group_shift_with_fill_value():
columns=["Z"],
index=None,
)
result = g.shift(-1, fill_value=0)[["Z"]]
result = g.shift(-1, fill_value=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have sufficient coverage of the tests from the OP?

Copy link
Member Author

@smithto1 smithto1 Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @jreback. I have investigated it.

The original problem was that the grouping columns were improperly returned. The old form of this test would extract the value column ([["Z"]]), so the test did not detect that the grouping columns were returned. I have modified this test so it no longer extracts the value column; if the grouping columns are returned (the OP) this test will now fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk great

if u can resolve conflicts and merge master

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback conflicts resolved. I think it can be merged.

@jreback jreback merged commit 94a7418 into pandas-dev:master Jul 28, 2021
@jreback
Copy link
Contributor

jreback commented Jul 28, 2021

thanks @smithto1 for the patch and the patience! keep em coming!

CGe0516 pushed a commit to CGe0516/pandas that referenced this pull request Jul 29, 2021
Leonardofreua pushed a commit to Leonardofreua/pandas that referenced this pull request Jul 30, 2021
attack68 added a commit that referenced this pull request Jul 31, 2021
* TST: Fix doctests for pandas.io.formats.style

* Modified: pandas/io/formats/style.py

* Added some expected results

* Skipped some tests

* TST: Add link to redirect to Table Visualization user guide

* Modified style.py

* Updated the doctest of the apply()

* Updated the doctest of the applymap()

* Updated the doctest of the set_table_styles()

* Updated the doctest of the set_properties()

* TST: Add image to pipe function result

* Modified style.py

* Updated the doctest of the pipe()

* TST: Remove unnecessary outputs

* Modified pandas/io/formats/style.py

* Updated the doctests of the set_tooltips()

* Updated the doctests of the to_latex()

* Updated the doctests of the set_td_classes()

* Updated the doctests of the set_table_attributes()

* TST: Add the output to the Styler.format doctest in to_latex()

* REG: DataFrame.agg where func returns lists and axis=1 (#42762)

* Fix typing issues for CI (#42770)

* BUG: groupby.shift returns different columns when fill_value is specified (#41858)

* PERF: extract_array earlier in DataFrame construction (#42774)

* ENH: `sparse_columns` and `sparse_index` added to `Styler.to_html`  (#41946)

* TYP: Fix typing for searchsorted (#42788)

* DOC GH42756 Update documentation for pandas.DataFrame.drop to clarify tuples. (#42789)

* CI: Fix doctests (#42790)

* REGR: nanosecond timestamp comparisons to OOB datetimes (#42796)

* COMPAT: MPL 3.4.0 (#42803)

* Delete duplicates and unused code from reshape tests (#42802)

* REGR: ValueError raised when both prefix and names are set to None (#42690)

* REGR: ValueError raised when both prefix and names are set to None

* Update readers.py

* whitespace

* Update v1.3.1.rst

* Update v1.3.2.rst

* Update readers.py

* Update readers.py

Co-authored-by: Jeff Reback <[email protected]>

* TST: Add style.py to the doctest check

* TST: fixed eng_formatter doctest for #42671 (#42705)

* TST: Revert x and y position in some doctests

* Updated the doctest of the hide_columns()

Co-authored-by: Richard Shadrach <[email protected]>
Co-authored-by: Irv Lustig <[email protected]>
Co-authored-by: Thomas Smith <[email protected]>
Co-authored-by: jbrockmendel <[email protected]>
Co-authored-by: attack68 <[email protected]>
Co-authored-by: Mike Phung <[email protected]>
Co-authored-by: Matthew Zeitlin <[email protected]>
Co-authored-by: Thomas Li <[email protected]>
Co-authored-by: Patrick Hoefler <[email protected]>
Co-authored-by: Jeff Reback <[email protected]>
Co-authored-by: Krishna Chivukula <[email protected]>
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
feefladder pushed a commit to feefladder/pandas that referenced this pull request Sep 7, 2021
* TST: Fix doctests for pandas.io.formats.style

* Modified: pandas/io/formats/style.py

* Added some expected results

* Skipped some tests

* TST: Add link to redirect to Table Visualization user guide

* Modified style.py

* Updated the doctest of the apply()

* Updated the doctest of the applymap()

* Updated the doctest of the set_table_styles()

* Updated the doctest of the set_properties()

* TST: Add image to pipe function result

* Modified style.py

* Updated the doctest of the pipe()

* TST: Remove unnecessary outputs

* Modified pandas/io/formats/style.py

* Updated the doctests of the set_tooltips()

* Updated the doctests of the to_latex()

* Updated the doctests of the set_td_classes()

* Updated the doctests of the set_table_attributes()

* TST: Add the output to the Styler.format doctest in to_latex()

* REG: DataFrame.agg where func returns lists and axis=1 (pandas-dev#42762)

* Fix typing issues for CI (pandas-dev#42770)

* BUG: groupby.shift returns different columns when fill_value is specified (pandas-dev#41858)

* PERF: extract_array earlier in DataFrame construction (pandas-dev#42774)

* ENH: `sparse_columns` and `sparse_index` added to `Styler.to_html`  (pandas-dev#41946)

* TYP: Fix typing for searchsorted (pandas-dev#42788)

* DOC GH42756 Update documentation for pandas.DataFrame.drop to clarify tuples. (pandas-dev#42789)

* CI: Fix doctests (pandas-dev#42790)

* REGR: nanosecond timestamp comparisons to OOB datetimes (pandas-dev#42796)

* COMPAT: MPL 3.4.0 (pandas-dev#42803)

* Delete duplicates and unused code from reshape tests (pandas-dev#42802)

* REGR: ValueError raised when both prefix and names are set to None (pandas-dev#42690)

* REGR: ValueError raised when both prefix and names are set to None

* Update readers.py

* whitespace

* Update v1.3.1.rst

* Update v1.3.2.rst

* Update readers.py

* Update readers.py

Co-authored-by: Jeff Reback <[email protected]>

* TST: Add style.py to the doctest check

* TST: fixed eng_formatter doctest for pandas-dev#42671 (pandas-dev#42705)

* TST: Revert x and y position in some doctests

* Updated the doctest of the hide_columns()

Co-authored-by: Richard Shadrach <[email protected]>
Co-authored-by: Irv Lustig <[email protected]>
Co-authored-by: Thomas Smith <[email protected]>
Co-authored-by: jbrockmendel <[email protected]>
Co-authored-by: attack68 <[email protected]>
Co-authored-by: Mike Phung <[email protected]>
Co-authored-by: Matthew Zeitlin <[email protected]>
Co-authored-by: Thomas Li <[email protected]>
Co-authored-by: Patrick Hoefler <[email protected]>
Co-authored-by: Jeff Reback <[email protected]>
Co-authored-by: Krishna Chivukula <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Groupby Performance Memory or execution speed performance
Projects
None yet
6 participants