Skip to content

BUG: groupby.shift returns different columns when fill_value is specified #41858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Jul 28, 2021
12 changes: 12 additions & 0 deletions asv_bench/benchmarks/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -369,6 +369,18 @@ def time_category_size(self):
self.draws.groupby(self.cats).size()


class Shift:
def setup(self):
N = 18
self.df = DataFrame({"g": ["a", "b"] * 9, "v": list(range(N))})

def time_defaults(self):
self.df.groupby("g").shift()

def time_fill_value(self):
self.df.groupby("g").shift(fill_value=99)


class FillNA:
def setup(self):
N = 100
Expand Down
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -166,6 +166,7 @@ Performance improvements
~~~~~~~~~~~~~~~~~~~~~~~~
- Performance improvement in :meth:`.GroupBy.sample`, especially when ``weights`` argument provided (:issue:`34483`)
- Performance improvement in :meth:`.GroupBy.transform` for user-defined functions (:issue:`41598`)
- Performance improvement in :meth:`GroupBy.shift` when ``fill_value`` argument is provided (:issue:`26615`)

.. ---------------------------------------------------------------------------

Expand Down Expand Up @@ -260,6 +261,7 @@ Groupby/resample/rolling
^^^^^^^^^^^^^^^^^^^^^^^^
- Fixed bug in :meth:`SeriesGroupBy.apply` where passing an unrecognized string argument failed to raise ``TypeError`` when the underlying ``Series`` is empty (:issue:`42021`)
- Bug in :meth:`Series.rolling.apply`, :meth:`DataFrame.rolling.apply`, :meth:`Series.expanding.apply` and :meth:`DataFrame.expanding.apply` with ``engine="numba"`` where ``*args`` were being cached with the user passed function (:issue:`42287`)
- Bug in :meth:`GroupBy.shift` that would return the grouping columns if ``fill_value`` was not None (:issue:`41556`)
-

Reshaping
Expand Down
8 changes: 6 additions & 2 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2822,6 +2822,7 @@ def _get_cythonized_result(
result_is_index: bool = False,
pre_processing=None,
post_processing=None,
fill_value=None,
**kwargs,
):
"""
Expand Down Expand Up @@ -2872,6 +2873,8 @@ def _get_cythonized_result(
second argument, i.e. the signature should be
(ndarray, Type). If `needs_nullable=True`, a third argument should be
`nullable`, to allow for processing specific to nullable values.
fill_value : any, default None
The scalar value to use for newly introduced missing values.
**kwargs : dict
Extra arguments to be passed back to Cython funcs

Expand Down Expand Up @@ -2970,7 +2973,7 @@ def _get_cythonized_result(
result = result.reshape(-1)

if result_is_index:
result = algorithms.take_nd(values, result)
result = algorithms.take_nd(values, result, fill_value=fill_value)

if post_processing:
result = post_processing(result, inferences)
Expand Down Expand Up @@ -3017,7 +3020,7 @@ def shift(self, periods=1, freq=None, axis=0, fill_value=None):
tshift : Shift the time index, using the index’s frequency
if available.
"""
if freq is not None or axis != 0 or not isna(fill_value):
if freq is not None or axis != 0:
return self.apply(lambda x: x.shift(periods, freq, axis, fill_value))

return self._get_cythonized_result(
Expand All @@ -3027,6 +3030,7 @@ def shift(self, periods=1, freq=None, axis=0, fill_value=None):
needs_ngroups=True,
result_is_index=True,
periods=periods,
fill_value=fill_value,
)

@final
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/groupby/test_groupby_shift_diff.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ def test_group_shift_with_fill_value():
columns=["Z"],
index=None,
)
result = g.shift(-1, fill_value=0)[["Z"]]
result = g.shift(-1, fill_value=0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have sufficient coverage of the tests from the OP?

Copy link
Member Author

@smithto1 smithto1 Jul 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, @jreback. I have investigated it.

The original problem was that the grouping columns were improperly returned. The old form of this test would extract the value column ([["Z"]]), so the test did not detect that the grouping columns were returned. I have modified this test so it no longer extracts the value column; if the grouping columns are returned (the OP) this test will now fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

kk great

if u can resolve conflicts and merge master

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback conflicts resolved. I think it can be merged.


tm.assert_frame_equal(result, expected)

Expand Down