groupby + shift drops group columns when as_index is False #13519

tjader · 2016-06-26T21:23:52Z

Using groupby + shift seems to have changed behaviour in 0.17 and 0.18 compared to 0.16.
With as_index=False, I would expect the columns that the groupby is made over to remain in the output dataframe, but they are no longer present.

Code Sample, a copy-pastable example if possible

>>>import pandas as pd
>>>import numpy as np
>>>df = pd.DataFrame({'A': [1.0, 1, 1, 2, 2, 3], 'B': [1.0, 2, 3, 4, 5, 6]})
>>>df
     A    B
0  1.0  1.0
1  1.0  2.0
2  1.0  3.0
3  2.0  4.0
4  2.0  5.0
5  3.0  6.0
>>>df_sorted.groupby('A', as_index=False).shift(1)
     B
0  NaN
1  1.0
2  2.0
3  NaN
4  4.0
5  NaN

Expected Output

>>>pd.DataFrame({'A':[np.nan, 1, 1, np.nan, 2, np.nan], 'B':[np.nan, 1, 2, np.nan, 4, np.nan]})
     A    B
0  NaN  NaN
1  1.0  1.0
2  1.0  2.0
3  NaN  NaN
4  2.0  4.0
5  NaN  NaN

output of `pd.show_versions()`

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.11.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 58 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.18.1
nose: 1.3.7
pip: 8.1.2
setuptools: 23.0.0
Cython: 0.24
numpy: 1.10.4
scipy: 0.17.1
statsmodels: None
xarray: 0.7.2
IPython: None
sphinx: None
patsy: None
dateutil: 2.4.1
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.5.2
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
boto: 2.40.0
pandas_datareader: None

I have also confirmed the issue on an similar install Linux installation using pandas 0.18.1

The text was updated successfully, but these errors were encountered:

halfmoonhalf · 2017-09-18T02:07:27Z

I found the same issue with groupby.shift and groupby.apply. Depending on different arg, the result is pretty confusing: sometimes the groupby columns is dropped, sometimes it is still a column.
code:

import pandas as pd

df = pd.DataFrame({
    'name': ['a', 'a', 'b', 'b'],
    'col_1': [0, 1, 2, 3],
    'col_2': [4, 5, 6, 7]
})

print df

print '+++++++++++++++++++++++++++++++++++++++++++++++++++'
print '+++++++++++++++++++++++++++++++++++++++++++++++++++'


def f(group):
    return group.ewm(alpha=0.5).mean()


print df.groupby('name', as_index=True).apply(f)
print '-' * 20
print df.groupby('name', as_index=False).apply(f)
print '-' * 20
print df.groupby('name', as_index=True)['col_1'].apply(f)
print '-' * 20
print df.groupby('name', as_index=False)['col_1'].apply(f)

print '+++++++++++++++++++++++++++++++++++++++++++++++++++'
print '+++++++++++++++++++++++++++++++++++++++++++++++++++'

print df.groupby('name', as_index=True).shift(1)
print '-' * 20
print df.groupby('name', as_index=False).shift(1)
print '-' * 20
print df.groupby('name', as_index=True)['col_1'].shift(1)
print '-' * 20
print df.groupby('name', as_index=False)['col_1'].shift(1)
print '-' * 20

output is:

   col_1  col_2 name
0      0      4    a
1      1      5    a
2      2      6    b
3      3      7    b
+++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++
      col_1     col_2 name  # <<-------------------------------name is still a column?
0  0.000000  4.000000    a
1  0.666667  4.666667    a
2  2.000000  6.000000    b
3  2.666667  6.666667    b
--------------------
      col_1     col_2 name
0  0.000000  4.000000    a
1  0.666667  4.666667    a
2  2.000000  6.000000    b
3  2.666667  6.666667    b
--------------------
0    0.000000  # <<-------------------------------name should be an index?
1    0.666667
2    2.000000
3    2.666667
Name: col_1, dtype: float64
--------------------
0  0    0.000000  # <<-------------------------------multiindex now?
   1    0.666667
1  2    2.000000
   3    2.666667
Name: col_1, dtype: float64
+++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++++
   col_1  col_2
0    NaN    NaN
1    0.0    4.0
2    NaN    NaN
3    2.0    6.0
--------------------
   col_1  col_2
0    NaN    NaN
1    0.0    4.0
2    NaN    NaN
3    2.0    6.0
--------------------
0    NaN
1    0.0
2    NaN
3    2.0
Name: col_1, dtype: float64
--------------------
   col_1  # <<-------------------------------why is this a dataframe?
0    NaN
1    0.0
2    NaN
3    2.0
--------------------

pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.13.final.0
python-bits: 64
OS: Darwin
OS-release: 14.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: en_US.UTF-8
LANG: en_US.UTF-8
LOCALE: None.None

pandas: 0.20.3
pytest: 2.8.5
pip: 9.0.1
setuptools: 20.3
Cython: 0.23.4
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 4.1.2
sphinx: 1.3.5
patsy: 0.4.0
dateutil: 2.5.1
pytz: 2016.2
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.6.2
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.4
lxml: 3.6.0
bs4: 4.4.1
html5lib: None
sqlalchemy: 1.0.12
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_gbq: None
pandas_datareader: None

MrRobot2211 · 2019-10-16T14:46:00Z

I am having the exact same issue. Is tere any workaround?

vladu · 2019-10-28T21:14:20Z

Ran into the same issue today. There is a workaround. If you reset_index before the groupby, then assign still works correctly after the shift. So, if you're only trying to shift one column, you can do:

frm = frm.assign(shifted_val=frm.groupby('key').shift(1)['val'])

Or, if you're trying to shift the whole frame, you can assign the group col back:

shifted_frm = frm.groupby('key').shift(1)
shifted_frm = shifted_frm.assign(key=frm['key'])

vladu · 2019-10-28T21:18:42Z

Also, not sure if anything changed since the original report, but now in v0.25, the grouping cols disappear after a groupby + shift regardless of the value of as_index.

thoughtfuldata · 2021-03-16T00:00:09Z

has there been any progress on this? I'm getting the same issue as @vladu

jreback · 2021-03-16T00:11:12Z

@thoughtfuldata you or anyone else can solve this by contributing a pull request

vladu · 2021-03-16T19:43:31Z

I set up a pandas dev view on my machine this morning and poked around a bit. The underlying issue is actually much more wide-spread than just the group+shift. It appears that all transforms actually drop the grouping columns. You can in fact see this in the documentation:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.core.groupby.DataFrameGroupBy.transform.html

Note that in the examples section, the frame is grouped by column 'A', and that column appears nowhere in the output. I'll poke around some more, but worried this might have wider ranging consequences than I originally anticipated.

The transforms do preserve the original frame's index, so an easy workaround is to set_index first, then groupby the same columns. But that's inconsistent with agg operations, so I'll see if I can come up with a fix that isn't too extensive.

yanyu2015 · 2022-11-18T02:47:45Z

the grouping cols disappear after a groupby + shift regardless of the value of as_index=True.
Has there been any progress on this?

pandas           : 1.4.4

rhshadrach · 2023-07-15T11:20:33Z

This is behaving as expected. shift is a transformation and so as_index has no impact on its output. See the docs here:

https://pandas.pydata.org/docs/user_guide/groupby.html#transformation

A proposal to add functionality so that as_index impacts transforms is here: #49543

jreback mentioned this issue Jun 27, 2016

Consistency with groupby as_index #5755

Closed

8 tasks

jreback added Groupby API Design labels Jun 27, 2016

jreback added this to the Next Major Release milestone Jun 27, 2016

jreback added Difficulty Intermediate labels Jun 27, 2016

jbrockmendel removed Difficulty Intermediate labels Oct 21, 2019

WillAyd mentioned this issue Oct 28, 2019

[BUG] Fixed behavior of DataFrameGroupBy.apply to respect _group_selection_context #29131

Closed

5 tasks

mroeschke added Bug and removed API Design labels May 1, 2021

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

rhshadrach closed this as completed Jul 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

groupby + shift drops group columns when as_index is False #13519

groupby + shift drops group columns when as_index is False #13519

tjader commented Jun 26, 2016

halfmoonhalf commented Sep 18, 2017 •

edited

Loading

MrRobot2211 commented Oct 16, 2019

vladu commented Oct 28, 2019

vladu commented Oct 28, 2019

thoughtfuldata commented Mar 16, 2021

jreback commented Mar 16, 2021

vladu commented Mar 16, 2021 •

edited

Loading

yanyu2015 commented Nov 18, 2022

rhshadrach commented Jul 15, 2023

groupby + shift drops group columns when as_index is False #13519

groupby + shift drops group columns when as_index is False #13519

Comments

tjader commented Jun 26, 2016

Code Sample, a copy-pastable example if possible

Expected Output

output of pd.show_versions()

halfmoonhalf commented Sep 18, 2017 • edited Loading

MrRobot2211 commented Oct 16, 2019

vladu commented Oct 28, 2019

vladu commented Oct 28, 2019

thoughtfuldata commented Mar 16, 2021

jreback commented Mar 16, 2021

vladu commented Mar 16, 2021 • edited Loading

yanyu2015 commented Nov 18, 2022

rhshadrach commented Jul 15, 2023

output of `pd.show_versions()`

halfmoonhalf commented Sep 18, 2017 •

edited

Loading

vladu commented Mar 16, 2021 •

edited

Loading