Skip to content

BUG: Groupby.transform with tshift giving incorrect result #32344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ryankarlos opened this issue Feb 29, 2020 · 2 comments
Open

BUG: Groupby.transform with tshift giving incorrect result #32344

ryankarlos opened this issue Feb 29, 2020 · 2 comments
Labels
Bug Datetime Datetime data dtype Groupby

Comments

@ryankarlos
Copy link
Contributor

ryankarlos commented Feb 29, 2020

Came across this inconsistency whilst trying to write a test for tshift in #32069

>>> import pandas as pd
>>> df = pd.DataFrame({'A':[121, 121, 121, 121, 231, 231, 676], 'B':[1.0, 2.0, 2.0, 3.0, 3.0, 3.0, 4.0], "C": pd.date_range("2013-11-03", periods=7)})

>>> df
     A    B          C
0  121  1.0 2013-11-03
1  121  2.0 2013-11-04
2  121  2.0 2013-11-05
3  121  3.0 2013-11-06
4  231  3.0 2013-11-07
5  231  3.0 2013-11-08
6  676  4.0 2013-11-09

>>> g = df.set_index("C").groupby("A")

>>> g.transform(lambda x: x.tshift(2, "D"))
              B
C              
2013-11-03  1.0
2013-11-04  2.0
2013-11-05  2.0
2013-11-06  3.0
2013-11-07  3.0
2013-11-08  3.0
2013-11-09  4.0

>>> g.transform("tshift", *[2, "D"])
              B
C              
2013-11-03  1.0
2013-11-04  2.0
2013-11-05  2.0
2013-11-06  3.0
2013-11-07  3.0
2013-11-08  3.0
2013-11-09  4.0

Problem description

Using tshift in groupby.transform seems to drop A from the index. Also, this seems to be leaving the dates unshifted as seen in the results above.

Expected Output

Would expect something like below which is achieved correctly using groupby.tshift

>>> g.tshift(2, "D")

                  B
A   C              
121 2013-11-05  1.0
    2013-11-06  2.0
    2013-11-07  2.0
    2013-11-08  3.0
231 2013-11-09  3.0
    2013-11-10  3.0
676 2013-11-11  4.0

Output of pd.show_versions()

INSTALLED VERSIONS

commit : 94befe6
python : 3.7.6.final.0
python-bits : 64
OS : Darwin
OS-release : 19.0.0
machine : x86_64
processor : i386
byteorder : little
LC_ALL : None
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8

pandas : 0.25.0.dev0+3348.g94befe6.dirty
numpy : 1.17.3
pytz : 2019.3
dateutil : 2.8.1
pip : 19.3.1
setuptools : 42.0.2.post20191201
Cython : 0.29.14
pytest : 5.3.2
hypothesis : 4.56.3
sphinx : 2.3.1
blosc : None
feather : None
xlsxwriter : 1.2.7
lxml.etree : 4.4.2
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.3
IPython : 7.10.2
pandas_datareader: None
bs4 : 4.8.2
bottleneck : 1.3.1
fastparquet : 0.3.2
gcsfs : None
lxml.etree : 4.4.2
matplotlib : 3.1.2
numexpr : 2.7.0
odfpy : None
openpyxl : 3.0.1
pandas_gbq : None
pyarrow : 0.15.1
pytables : None
pytest : 5.3.2
pyxlsb : None
s3fs : 0.4.0
scipy : 1.4.1
sqlalchemy : 1.3.12
tables : 3.6.1
tabulate : None
xarray : 0.14.1
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.2.7
numba : 0.46.0

@rhshadrach
Copy link
Member

While tshift has been removed, this issue still applies to shift. The g.transform("shift", *[2, "D"]) op is now giving the expected output, but g.transform(lambda x: x.tshift(2, "D")) raises.

It appears to me that the freq argument in groupby is unnecessary: there is no difference in the result if one does this on the underlying object (Series or DataFrame). I think we can deprecate that argument in groupby to resolve this issue.

cc @mroeschke @MarcoGorelli

@rhshadrach
Copy link
Member

This is also similar to #23918, cc @jbrockmendel @WillAyd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Groupby
Projects
None yet
Development

No branches or pull requests

4 participants