Skip to content

groupby + shift drops group columns when as_index is False #13519

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tjader opened this issue Jun 26, 2016 · 9 comments
Closed

groupby + shift drops group columns when as_index is False #13519

tjader opened this issue Jun 26, 2016 · 9 comments

Comments

@tjader
Copy link

tjader commented Jun 26, 2016

Using groupby + shift seems to have changed behaviour in 0.17 and 0.18 compared to 0.16.
With as_index=False, I would expect the columns that the groupby is made over to remain in the output dataframe, but they are no longer present.

Code Sample, a copy-pastable example if possible

>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame({'A': [1.0, 1, 1, 2, 2, 3], 'B': [1.0, 2, 3, 4, 5, 6]})
>>>df
     A    B
0  1.0  1.0
1  1.0  2.0
2  1.0  3.0
3  2.0  4.0
4  2.0  5.0
5  3.0  6.0
>>>df_sorted.groupby('A', as_index=False).shift(1)
     B
0  NaN
1  1.0
2  2.0
3  NaN
4  4.0
5  NaN

Expected Output

>>>pd.DataFrame({'A':[np.nan, 1, 1, np.nan, 2, np.nan], 'B':[np.nan, 1, 2, np.nan, 4, np.nan]})
     A    B
0  NaN  NaN
1  1.0  1.0
2  1.0  2.0
3  NaN  NaN
4  2.0  4.0
5  NaN  NaN

output of pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.1
statsmodels: None
xarray: 0.7.2
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.1
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

I have also confirmed the issue on an similar install Linux installation using pandas 0.18.1

@jreback jreback added this to the Next Major Release milestone Jun 27, 2016
@halfmoonhalf
Copy link

halfmoonhalf commented Sep 18, 2017

I found the same issue with groupby.shift and groupby.apply. Depending on different arg, the result is pretty confusing: sometimes the groupby columns is dropped, sometimes it is still a column.
code:

import pandas as pd

df = pd.DataFrame({
    'name': ['a', 'a', 'b', 'b'],
    'col_1': [0, 1, 2, 3],
    'col_2': [4, 5, 6, 7]
})

print df

print '+++++++++++++++++++++++++++++++++++++++++++++++++++'
print '+++++++++++++++++++++++++++++++++++++++++++++++++++'


def f(group):
    return group.ewm(alpha=0.5).mean()


print df.groupby('name', as_index=True).apply(f)
print '-' * 20
print df.groupby('name', as_index=False).apply(f)
print '-' * 20
print df.groupby('name', as_index=True)['col_1'].apply(f)
print '-' * 20
print df.groupby('name', as_index=False)['col_1'].apply(f)

print '+++++++++++++++++++++++++++++++++++++++++++++++++++'
print '+++++++++++++++++++++++++++++++++++++++++++++++++++'

print df.groupby('name', as_index=True).shift(1)
print '-' * 20
print df.groupby('name', as_index=False).shift(1)
print '-' * 20
print df.groupby('name', as_index=True)['col_1'].shift(1)
print '-' * 20
print df.groupby('name', as_index=False)['col_1'].shift(1)
print '-' * 20

output is:

   col_1  col_2 name
0      0      4    a
1      1      5    a
2      2      6    b
3      3      7    b
+++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++
      col_1     col_2 name  # <<-------------------------------name is still a column?
0  0.000000  4.000000    a
1  0.666667  4.666667    a
2  2.000000  6.000000    b
3  2.666667  6.666667    b
--------------------
      col_1     col_2 name
0  0.000000  4.000000    a
1  0.666667  4.666667    a
2  2.000000  6.000000    b
3  2.666667  6.666667    b
--------------------
0    0.000000  # <<-------------------------------name should be an index?
1    0.666667
2    2.000000
3    2.666667
Name: col_1, dtype: float64
--------------------
0  0    0.000000  # <<-------------------------------multiindex now?
   1    0.666667
1  2    2.000000
   3    2.666667
Name: col_1, dtype: float64
+++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++
   col_1  col_2
0    NaN    NaN
1    0.0    4.0
2    NaN    NaN
3    2.0    6.0
--------------------
   col_1  col_2
0    NaN    NaN
1    0.0    4.0
2    NaN    NaN
3    2.0    6.0
--------------------
0    NaN
1    0.0
2    NaN
3    2.0
Name: col_1, dtype: float64
--------------------
   col_1  # <<-------------------------------why is this a dataframe?
0    NaN
1    0.0
2    NaN
3    2.0
--------------------

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

@MrRobot2211
Copy link

I am having the exact same issue. Is tere any workaround?

@vladu
Copy link
Contributor

vladu commented Oct 28, 2019

Ran into the same issue today. There is a workaround. If you reset_index before the groupby, then assign still works correctly after the shift. So, if you're only trying to shift one column, you can do:

frm = frm.assign(shifted_val=frm.groupby('key').shift(1)['val'])

Or, if you're trying to shift the whole frame, you can assign the group col back:

shifted_frm = frm.groupby('key').shift(1)
shifted_frm = shifted_frm.assign(key=frm['key'])

@vladu
Copy link
Contributor

vladu commented Oct 28, 2019

Also, not sure if anything changed since the original report, but now in v0.25, the grouping cols disappear after a groupby + shift regardless of the value of as_index.

@thoughtfuldata
Copy link

has there been any progress on this? I'm getting the same issue as @vladu

@jreback
Copy link
Contributor

jreback commented Mar 16, 2021

@thoughtfuldata you or anyone else can solve this by contributing a pull request

@vladu
Copy link
Contributor

vladu commented Mar 16, 2021

I set up a pandas dev view on my machine this morning and poked around a bit. The underlying issue is actually much more wide-spread than just the group+shift. It appears that all transforms actually drop the grouping columns. You can in fact see this in the documentation:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html

Note that in the examples section, the frame is grouped by column 'A', and that column appears nowhere in the output. I'll poke around some more, but worried this might have wider ranging consequences than I originally anticipated.

The transforms do preserve the original frame's index, so an easy workaround is to set_index first, then groupby the same columns. But that's inconsistent with agg operations, so I'll see if I can come up with a fix that isn't too extensive.

@mroeschke mroeschke added Bug and removed API Design labels May 1, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@yanyu2015
Copy link

the grouping cols disappear after a groupby + shift regardless of the value of as_index=True.
Has there been any progress on this?

pandas           : 1.4.4

@rhshadrach
Copy link
Member

This is behaving as expected. shift is a transformation and so as_index has no impact on its output. See the docs here:

https://pandas.pydata.org/docs/user_guide/groupby.html#transformation

A proposal to add functionality so that as_index impacts transforms is here: #49543

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

10 participants