Skip to content

BUG: groupby.tshift inconsistent behavior with other groupby transformations #34452

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
fujiaxiang opened this issue May 29, 2020 · 3 comments
Closed
Labels
Deprecate Functionality to remove in pandas Groupby

Comments

@fujiaxiang
Copy link
Member

I discovered this while trying to tackle issue #32344, where @ryankarlos mentioned groupby.transform('tshift', ...) seems to behave incorrectly.

However, before we can address #32344, we probably need to address this.

# on current master
>>> import pandas as pd
>>> import numpy as np

>>> pd.__version__
'1.1.0.dev0+1708.g043b60920'

>>> df = pd.DataFrame(
...     {
...     "A": ["foo", "foo", "foo", "foo", "bar", "bar", "baz"],
...     "B": [1, 2, np.nan, 3, 3, np.nan, 4],
...     },
...     index=pd.date_range('2020-01-01', '2020-01-07')
... )
>>> df
              A    B
2020-01-01  foo  1.0
2020-01-02  foo  2.0
2020-01-03  foo  NaN
2020-01-04  foo  3.0
2020-01-05  bar  3.0
2020-01-06  bar  NaN
2020-01-07  baz  4.0

>>> df.groupby("A").tshift(1, "D")
                  B
A
bar 2020-01-06  3.0
    2020-01-07  NaN
baz 2020-01-08  4.0
foo 2020-01-02  1.0
    2020-01-03  2.0
    2020-01-04  NaN
    2020-01-05  3.0

>>> df.groupby("A").ffill()
              B
2020-01-01  1.0
2020-01-02  2.0
2020-01-03  2.0
2020-01-04  3.0
2020-01-05  3.0
2020-01-06  3.0
2020-01-07  4.0

>>> df.groupby("A").cumsum()
              B
2020-01-01  1.0
2020-01-02  3.0
2020-01-03  NaN
2020-01-04  6.0
2020-01-05  3.0
2020-01-06  NaN
2020-01-07  4.0

We can see that groupby.tshift is inconsistent with other groupby transformations. It retains the groupby column, and more importantly reordered the data.

Since 0.25 we have had deliberate effort to make all groupby transformations consistent, see https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#dataframe-groupby-ffill-bfill-no-longer-return-group-labels

Following this thinking I would expect the returned data to behave more like

>>> df.groupby("A").tshift(1, "D")  # this is actually the result of df.tshift(1, "D").drop(columns='A')
              B
2020-01-02  1.0
2020-01-03  2.0
2020-01-04  NaN
2020-01-05  3.0
2020-01-06  3.0
2020-01-07  NaN
2020-01-08  4.0

However, if we are to make groupby.tshift consistent with other groupby transformation like the above, this makes it no different from df.tshift(1, "D").drop(columns='A')', and groupby` has lost its meaning here.

Perhaps we should just deprecate groupby.tshift entirely? I know #11631 discussed about deprecating tshift, but that has been stalled for a long time.

@fujiaxiang fujiaxiang added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020
@WillAyd
Copy link
Member

WillAyd commented May 29, 2020

Unless there's a usability gap between shift / tshift I think OK to deprecate the latter

@WillAyd WillAyd added API Design Deprecate Functionality to remove in pandas and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 29, 2020
@jreback
Copy link
Contributor

jreback commented Jun 15, 2020

@fujiaxiang we should just deprecate tshift on .groupby and call it a day.

@mroeschke
Copy link
Member

Looks like tshift has been deprecated so closing as a won't fix:

In [3]: >>> df = pd.DataFrame(
   ...: ...     {
   ...: ...     "A": ["foo", "foo", "foo", "foo", "bar", "bar", "baz"],
   ...: ...     "B": [1, 2, np.nan, 3, 3, np.nan, 4],
   ...: ...     },
   ...: ...     index=pd.date_range('2020-01-01', '2020-01-07')
   ...: ... )

In [4]: >>> df.groupby("A").tshift(1, "D")
pandas/core/groupby/groupby.py:934: FutureWarning: tshift is deprecated and will be removed in a future version. Please use shift instead.
  return f(x, *args, **kwargs)
Out[4]:
                  B
A
bar 2020-01-06  3.0
    2020-01-07  NaN
baz 2020-01-08  4.0
foo 2020-01-02  1.0
    2020-01-03  2.0
    2020-01-04  NaN
    2020-01-05  3.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Deprecate Functionality to remove in pandas Groupby
Projects
None yet
Development

No branches or pull requests

4 participants