DataFrameGroupby.resample with the `on` keyword does not produce the same output as on DateTimeIndex #27343

philippegr · 2019-07-11T18:18:58Z

Code Sample, a copy-pastable example if possible

df = pd.DataFrame.from_records({'ref':['a','a','a','b','b'], 
                           'time':[dt.datetime(2014,12,31), dt.datetime(2015,12,31), dt.datetime(2016,12,31), 
                                   dt.datetime(2012,12,31), dt.datetime(2014,12,31)],
                          'value':5*[1]})

# These frames differs
pd.DataFrame.equals(df.groupby("ref").resample(rule='M', on='time')['value'].sum(), 
                    df.set_index('time').groupby("ref").resample(rule='M')['value'].sum())

# Similarly to .set_index() .apply produces the correct output 
pd.DataFrame.equals(df.groupby("ref").apply(lambda f :f.resample(rule='M', on='time')['value'].sum()), 
                    df.set_index('time').groupby("ref").resample(rule='M')['value'].sum())

#This is the incorrect frame
df.groupby("ref").resample(rule='5T', on='time')['value'].sum().to_frame()

Problem description

DataFrameGroupby.resample with the on keyword (last line) produces an incorrect output: it differs from the output produce with a DateTimeIndex and has incorrect values. It seems that it does not handle the group properly:

the result has an entry for ('a', 2012-12-31 00:00:00):1 that is incorrect ref =='a' only starts in 2014 so not only should there be no entry but value of 1 is incorrect
the result has entries for 'b' starting in 2015 when
ref == 'b' has a min time entry in 2015 so there should not be any entry before that regardless of its value

Expected Output

True for pd.DataFrame.equals(df.groupby("ref").resample(rule='M', on='time')['value'].sum(), df.set_index('time').groupby("ref").resample(rule='M')['value'].sum())

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Darwin
OS-release: 18.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.6.1
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

jreback · 2019-07-11T18:22:02Z

we just updated the doc-string here: https://dev.pandas.io/reference/api/pandas.DataFrame.rolling.html?highlight=rolling#pandas.DataFrame.rolling

.resample could be the same

philippegr · 2019-07-11T18:30:19Z

I see the difference in .rolling()

Provided integer column is ignored and excluded from result since an integer index is not used to calculate the rolling window.

but I am not sure what it means here, I am grouping on an object (string) column

philippegr · 2019-07-11T18:37:11Z

Added to the initial example that .groupby().apply(lambda : resample) works as expected

philippegr · 2019-07-12T13:28:44Z

I saw the issue has been labelled [Docs], but from the outside I still believes it is a Bug

That is the initial frame. (Below, compared to my initial example I changed the rule from 5T to 6M so the frame fits here, but the problem remains the same). It seems that the resample is not done on the group but on the whole frame and incorrectly labelled with the group label).

ref	time	value
a	2014-12-31	1
a	2015-12-31	1
a	2016-12-31	1
b	2012-12-31	1
b	2014-12-31	1

the output from df.groupby("ref").apply(lambda f : f.resample(rule='6M', on='time')['value'].sum())
as expected (time range 2014-2016 for a and 2012-2014 for b, all good):

ref  time      
a    2014-12-31    1
     2015-06-30    0
     2015-12-31    1
     2016-06-30    0
     2016-12-31    1
b    2012-12-31    1
     2013-06-30    0
     2013-12-31    0
     2014-06-30    0
     2014-12-31    1
Name: value, dtype: int64

But the output of df.groupby("ref").resample(rule='6M', on='time')['value'].sum() is not correct: time for a range from 2012 to 2014 (both too early) and for b from 2015 to 2016 (too early and ok respectively).

ref  time      
a    2012-12-31    1
     2013-06-30    0
     2013-12-31    0
     2014-06-30    0
     2014-12-31    2
b    2015-12-31    1
     2016-06-30    0
     2016-12-31    1
Name: value, dtype: int64

valkum · 2020-07-21T12:53:45Z

I think I ended up with the same bug. You can find an additional example here: https://repl.it/@valkum/SoulfulFrequentBooleanlogic#main.py
I am not sure if this is related to #35275

gfyoung added Compat pandas objects compatability with Numpy or Python functions Docs Datetime Datetime data dtype labels Jul 12, 2019

mroeschke added Groupby Resample resample method and removed Compat pandas objects compatability with Numpy or Python functions labels Apr 2, 2020

valkum mentioned this issue Jul 21, 2020

BUG: agg on groups with different sizes fails with out of bounds IndexError #35275

Open

3 tasks

phofl mentioned this issue Sep 7, 2020

[BUG]: Groupy and Resample miscalculated aggregation #36198

Closed

7 tasks

jreback mentioned this issue Nov 19, 2020

API/BUG: DatetimeIndex.argsort does not match DatetimeArray.argsort #37863

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DataFrameGroupby.resample with the `on` keyword does not produce the same output as on DateTimeIndex #27343

DataFrameGroupby.resample with the `on` keyword does not produce the same output as on DateTimeIndex #27343

philippegr commented Jul 11, 2019 •

edited

Loading

INSTALLED VERSIONS

jreback commented Jul 11, 2019

philippegr commented Jul 11, 2019 •

edited

Loading

philippegr commented Jul 11, 2019 •

edited

Loading

philippegr commented Jul 12, 2019 •

edited

Loading

valkum commented Jul 21, 2020

DataFrameGroupby.resample with the on keyword does not produce the same output as on DateTimeIndex #27343

DataFrameGroupby.resample with the on keyword does not produce the same output as on DateTimeIndex #27343

Comments

philippegr commented Jul 11, 2019 • edited Loading

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

jreback commented Jul 11, 2019

philippegr commented Jul 11, 2019 • edited Loading

philippegr commented Jul 11, 2019 • edited Loading

philippegr commented Jul 12, 2019 • edited Loading

valkum commented Jul 21, 2020

DataFrameGroupby.resample with the `on` keyword does not produce the same output as on DateTimeIndex #27343

DataFrameGroupby.resample with the `on` keyword does not produce the same output as on DateTimeIndex #27343

philippegr commented Jul 11, 2019 •

edited

Loading

Output of `pd.show_versions()`

philippegr commented Jul 11, 2019 •

edited

Loading

philippegr commented Jul 11, 2019 •

edited

Loading

philippegr commented Jul 12, 2019 •

edited

Loading