Skip to content

Shifting GroupBy Object with freq inserts unwanted index level #23918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
WillAyd opened this issue Nov 26, 2018 · 2 comments
Open

Shifting GroupBy Object with freq inserts unwanted index level #23918

WillAyd opened this issue Nov 26, 2018 · 2 comments
Labels
Bug Datetime Datetime data dtype Frequency DateOffsets Groupby

Comments

@WillAyd
Copy link
Member

WillAyd commented Nov 26, 2018

IMO somewhat unexpected behavior depending on the arguments provided to shift after a groupby:

In [26]: ser = pd.Series(range(3), index=pd.date_range('1/1/18', periods=3))  

In [28]: ser.groupby([1,1,1]).shift()                                                                                                      
Out[28]: 
2018-01-01    NaN
2018-01-02    0.0
2018-01-03    1.0
Freq: D, dtype: float64

# No-op with freq keyword
In [34]: ser.groupby([1,1,1]).shift(0, freq='D')                                                                                           
Out[34]: 
2018-01-01    0
2018-01-02    1
2018-01-03    2
Freq: D, Name: 1, dtype: int64

# Suddenly inserted the grouper into the first level of a MI?
In [30]: ser.groupby([1,1,1]).shift(freq='D')                                                                                              
Out[30]: 
1  2018-01-02    0
   2018-01-03    1
   2018-01-04    2
Name: 1, dtype: int64

This is arguably in contrast to #22053 which is asking for the behavior of the last item consistently, but I think that is the one that is actually incorrect out of all examples and causes misalignment with the index of the original caller.

cc @SimonAlecks who originally found this in #21235

Expected Output

In [30]: ser.groupby([1,1,1]).shift(freq='D')                                                                                              
Out[34]: 
2018-01-01    NaN
2018-01-02    0.0
2018-01-03    1.0
Freq: D, dtype: float64

Output of pd.show_versions()

INSTALLED VERSIONS

commit: 1f1f705
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.24.0.dev0+1123.g1f1f7053d.dirty
pytest: 4.0.0
pip: 18.1
setuptools: 40.6.2
Cython: 0.29
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.0
IPython: 7.1.1
sphinx: 1.8.2
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.1
openpyxl: 2.5.9
xlrd: 1.1.0
xlwt: 1.2.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.14
pymysql: 0.9.2
psycopg2: None
jinja2: 2.10
s3fs: 0.1.6
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: 0.2.0

@simonariddell
Copy link
Contributor

simonariddell commented Dec 14, 2018

So @WillAyd here is what I can tell so far,

def _python_apply_general(self, f):
keys, values, mutated = self.grouper.apply(f, self._selected_obj,
self.axis)

In this method mutated returns as True in the case where the index is shifted, which then passes into line 712 as not_indexed_same. In this case keys is the same value for both the case where it has expected and unexpected benefits

keys
>Int64Index([1], dtype='int64')

One question is, is this expected behavior here?

This then passes here:

if not not_indexed_same:
result = concat(values, axis=self.axis)
ax = self._selected_obj._get_axis(self.axis)

If this block is hit, it's good.

However, the problem we see is here:

elif self.group_keys:
values = reset_identity(values)
if self.as_index:
# possible MI return case
group_keys = keys
group_levels = self.grouper.levels
group_names = self.grouper.names
result = concat(values, axis=self.axis, keys=group_keys,
levels=group_levels, names=group_names,
sort=False)

If we debug into it we can see what is happening

group_keys
> Int64Index([1], dtype='int64')

So this seems to be working as expected, it's just being given group_keys that don't really make sense.

In fact, we can generalize this problem a little by noticing it's 'unrelated' to the shift function so to speak.

gb = ser.groupby([1, 1, 1])
# gb.shift(freq='D')

def _mutate(x):
    return x.reindex(x.index[0:2])
gb.apply(_mutate)
> 1  2018-01-01    0
>   2018-01-02    1
> Name: 1, dtype: int64

Essentially any groupby operation that mutates the dataframe induces this strange index behavior.

However, what is interesting is that in all cases the group_keys are wrong (I'm assuming Int64Index([1], dtype='int64') is wrong). However, it's only the cases where the data is also mutated that it hits the conditional logic on groupby.py 906 (as included above), where this wrong key gives us unexpected output.

This originates from the call in the apply method in ops.py, specifically here:

def _get_group_keys(self):
if len(self.groupings) == 1:
return self.levels[0]

Where self.levels[0] returns Int64Index([1], dtype='int64'.

Anyway, I probably need to read more unit tests to understand the expected behavior and when and why having conditional logic for mutated indices is necessary, and when and why the group_keys works as intended. I'll continue looking into this and add updates.

def apply(self, f, data, axis=0):
mutated = self.mutated
splitter = self._get_splitter(data, axis=axis)
group_keys = self._get_group_keys()

@mroeschke mroeschke added Bug Frequency DateOffsets labels Apr 1, 2020
@jbrockmendel
Copy link
Member

@WillAyd on main i'm seeing a result that looks reasonable but also doesn't match your Expected. can you take a look

There's also a test_pct_change with an xfail that points back to this issue that looks like the current (raising) behavior is correct. any idea whats going on there?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Frequency DateOffsets Groupby
Projects
None yet
Development

No branches or pull requests

4 participants