Skip to content

BUG: wrong df.groupby().groups when grouping with [Grouper(freq=), ...] #33132

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
falcaopetri opened this issue Mar 29, 2020 · 6 comments
Closed

Comments

@falcaopetri
Copy link

falcaopetri commented Mar 29, 2020

Code

import pandas as pd
from datetime import datetime
mi = pd.MultiIndex.from_product([pd.date_range(datetime.today(), periods=2),
                                    ["C", "D"]], names=["alpha", "beta"])
df = pd.DataFrame({"foo": [1, 2, 1, 2], "bar": [1, 2, 3, 4]}, index=mi)
result = df.groupby([pd.Grouper(level="alpha", freq='D'), "beta"])

print(len(result), result.ngroups)
# 2 4
print(result.groups)
# {(Timestamp('2020-04-05 00:00:00'), 'C'): MultiIndex([('2020-04-05 15:04:51.580573', 'C')], names=['alpha', 'beta']), 
#  (Timestamp('2020-04-06 00:00:00'), 'D'): MultiIndex([('2020-04-05 15:04:51.580573', 'D')], names=['alpha', 'beta'])}

Problem description

This issue is an extension of the bug reported in #26326. The PR #26374 resolved the bug for the case of when we have a nested BaseGrouper. Nonetheless, having a nested BinGrouper still results in wrong behavior, as can be checked by the above code.

Note that len(result) is based on len(result.groups), and that result.groups should return the following:

# {(Timestamp('2020-04-05 00:00:00'), 'C'): MultiIndex([('2020-04-05 15:04:51.580573', 'C')], names=['alpha', 'beta']), 
#  (Timestamp('2020-04-05 00:00:00'), 'D'): MultiIndex([('2020-04-05 15:04:51.580573', 'D')], names=['alpha', 'beta']), 
#  (Timestamp('2020-04-06 00:00:00'), 'C'): MultiIndex([('2020-04-05 15:04:51.580573', 'C')], names=['alpha', 'beta']),
#  (Timestamp('2020-04-06 00:00:00'), 'D'): MultiIndex([('2020-04-05 15:04:51.580573', 'D')], names=['alpha', 'beta'])}
INSTALLED VERSIONS
------------------
commit           : 7673357191709036faad361cbb5f31a802703249
python           : 3.7.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.5.8-1-MANJARO
Version          : #1 SMP PREEMPT Thu Mar 5 20:29:51 UTC 2020
machine          : x86_64
processor        : 
byteorder        : little
LC_ALL           : C.UTF-8
LANG             : C.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.0.dev0+1027.g767335719.dirty
numpy            : 1.18.1
pytz             : 2019.3
dateutil         : 2.8.1
pip              : 20.0.2
setuptools       : 45.2.0.post20200210
Cython           : 0.29.15
pytest           : 5.4.1
hypothesis       : 5.6.0
sphinx           : 2.4.4
blosc            : None
feather          : None
xlsxwriter       : 1.2.8
lxml.etree       : 4.4.1
html5lib         : 1.0.1
pymysql          : None
psycopg2         : None
jinja2           : 2.11.1
IPython          : 7.13.0
pandas_datareader: None
bs4              : 4.8.2
bottleneck       : 1.3.2
fastparquet      : 0.3.3
gcsfs            : None
matplotlib       : 3.1.3
numexpr          : 2.7.1
odfpy            : None
openpyxl         : 3.0.1
pandas_gbq       : None
pyarrow          : 0.16.0
pytables         : None
pyxlsb           : None
s3fs             : 0.4.0
scipy            : 1.4.1
sqlalchemy       : 1.3.15
tables           : 3.6.1
tabulate         : 0.8.6
xarray           : 0.15.0
xlrd             : 1.2.0
xlwt             : 1.3.0
numba            : 0.48.0
@falcaopetri
Copy link
Author

Using the previous example code, we have:

result.grouper
# <pandas.core.groupby.ops.BaseGrouper at 0x7fbdc11255c0>
result.grouper.groupings
# [Grouping(alpha), Grouping(beta)]
result.grouper.groupings[0].grouper
# <pandas.core.groupby.ops.BinGrouper at 0x7fbdd05e0710>
result.grouper.groupings[0].grouper.groupings[0].grouper
# DatetimeIndex(['2020-03-29', '2020-03-30'], dtype='datetime64[ns]', name='alpha', freq=None)
result.grouper.groupings[1].grouper
Index(['C', 'D', 'C', 'D'], dtype='object', name='beta')

As discussed in #26326, the issue is in

to_groupby = zip(*(ping.grouper for ping in self.groupings))

This will zip the iteration over BinGrouper
([Timestamp('2020-03-29 00:00:00', freq='D'), Timestamp('2020-03-30 00:00:00', freq='D')]),
and result.grouper.groupings[1].grouper
(Index(['C', 'D', 'C', 'D'], dtype='object', name='beta')),
and we end up with
[(Timestamp('2020-03-29 00:00:00', freq='D'), 'C'), (Timestamp('2020-03-30 00:00:00', freq='D'), 'D')].

@falcaopetri
Copy link
Author

I've tried to fix this in the above PR, but it breaks too many things. The basic idea was to make BinGrouper.groupings aware of the correct index it should return.

def groupings(self) -> "List[grouper.Grouping]":
return [
grouper.Grouping(lvl, lvl, in_axis=False, level=None, name=name)
for lvl, name in zip(self.levels, self.names)
]

Any idea on a better approach?

@jreback jreback added the Bug label Mar 30, 2020
@jreback jreback added this to the 1.1 milestone Mar 30, 2020
@falcaopetri falcaopetri changed the title BUG: ngroups and len(groups) do not equal when grouping with a list of Grouper(freq=) and column label BUG: wrong df.groupby().groups when grouping with [Grouper(freq=), ...] Apr 5, 2020
falcaopetri added a commit to falcaopetri/pandas that referenced this issue Apr 5, 2020
@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

its possible this is resolved on master and resample has been updated a bit (which is what this ultimately calls). please re-test on master.

@TomAugspurger
Copy link
Contributor

Moving off 1.1, but there's an open PR, so we can add it back if that PR progresses.

@TomAugspurger TomAugspurger removed this from the 1.1 milestone Jul 7, 2020
@abudis
Copy link

abudis commented Nov 9, 2020

I can confirm that this bug exists on 1.1.3. It's a nasty one, because I was doing something like:

df.groupby(['someid', pd.Grouper(key='somedate', freq='30D')], sort=False)['somevalue'].mean().groupby(['someid']).mean()

which doesn't break, but instead produces incorrect mean values.

@rhshadrach
Copy link
Member

I believe this is a duplicate of #51158. Confirmed OP is now fixed on main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment