Skip to content

DataFrameGroupby.resample with the on keyword does not produce the same output as on DateTimeIndex #27343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
philippegr opened this issue Jul 11, 2019 · 5 comments
Labels
Datetime Datetime data dtype Docs Groupby Resample resample method

Comments

@philippegr
Copy link

philippegr commented Jul 11, 2019

Code Sample, a copy-pastable example if possible

df = pd.DataFrame.from_records({'ref':['a','a','a','b','b'], 
                           'time':[dt.datetime(2014,12,31), dt.datetime(2015,12,31), dt.datetime(2016,12,31), 
                                   dt.datetime(2012,12,31), dt.datetime(2014,12,31)],
                          'value':5*[1]})

# These frames differs
pd.DataFrame.equals(df.groupby("ref").resample(rule='M', on='time')['value'].sum(), 
                    df.set_index('time').groupby("ref").resample(rule='M')['value'].sum())

# Similarly to .set_index() .apply produces the correct output 
pd.DataFrame.equals(df.groupby("ref").apply(lambda f :f.resample(rule='M', on='time')['value'].sum()), 
                    df.set_index('time').groupby("ref").resample(rule='M')['value'].sum())

#This is the incorrect frame
df.groupby("ref").resample(rule='5T', on='time')['value'].sum().to_frame()

Problem description

DataFrameGroupby.resample with the on keyword (last line) produces an incorrect output: it differs from the output produce with a DateTimeIndex and has incorrect values. It seems that it does not handle the group properly:

  • the result has an entry for ('a', 2012-12-31 00:00:00):1 that is incorrect ref =='a' only starts in 2014 so not only should there be no entry but value of 1 is incorrect
  • the result has entries for 'b' starting in 2015 when
  • ref == 'b' has a min time entry in 2015 so there should not be any entry before that regardless of its value

Expected Output

True for pd.DataFrame.equals(df.groupby("ref").resample(rule='M', on='time')['value'].sum(), df.set_index('time').groupby("ref").resample(rule='M')['value'].sum())

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

INSTALLED VERSIONS

commit: None
python: 3.7.3.final.0
python-bits: 64
OS: Darwin
OS-release: 18.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_CA.UTF-8
LOCALE: en_CA.UTF-8

pandas: 0.24.2
pytest: None
pip: 19.1.1
setuptools: 41.0.1
Cython: None
numpy: 1.16.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.6.1
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2019.1
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10.1
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@jreback
Copy link
Contributor

jreback commented Jul 11, 2019

we just updated the doc-string here: https://dev.pandas.io/reference/api/pandas.DataFrame.rolling.html?highlight=rolling#pandas.DataFrame.rolling

.resample could be the same

@philippegr
Copy link
Author

philippegr commented Jul 11, 2019

I see the difference in .rolling()

Provided integer column is ignored and excluded from result since an integer index is not used to calculate the rolling window.

but I am not sure what it means here, I am grouping on an object (string) column

@philippegr
Copy link
Author

philippegr commented Jul 11, 2019

Added to the initial example that .groupby().apply(lambda : resample) works as expected

@gfyoung gfyoung added Compat pandas objects compatability with Numpy or Python functions Docs Datetime Datetime data dtype labels Jul 12, 2019
@philippegr
Copy link
Author

philippegr commented Jul 12, 2019

I saw the issue has been labelled [Docs], but from the outside I still believes it is a Bug

That is the initial frame. (Below, compared to my initial example I changed the rule from 5T to 6M so the frame fits here, but the problem remains the same). It seems that the resample is not done on the group but on the whole frame and incorrectly labelled with the group label).

  ref time value
a 2014-12-31 1
a 2015-12-31 1
a 2016-12-31 1
b 2012-12-31 1
b 2014-12-31 1

the output from df.groupby("ref").apply(lambda f : f.resample(rule='6M', on='time')['value'].sum())
as expected (time range 2014-2016 for a and 2012-2014 for b, all good):

ref  time      
a    2014-12-31    1
     2015-06-30    0
     2015-12-31    1
     2016-06-30    0
     2016-12-31    1
b    2012-12-31    1
     2013-06-30    0
     2013-12-31    0
     2014-06-30    0
     2014-12-31    1
Name: value, dtype: int64

But the output of df.groupby("ref").resample(rule='6M', on='time')['value'].sum() is not correct: time for a range from 2012 to 2014 (both too early) and for b from 2015 to 2016 (too early and ok respectively).

ref  time      
a    2012-12-31    1
     2013-06-30    0
     2013-12-31    0
     2014-06-30    0
     2014-12-31    2
b    2015-12-31    1
     2016-06-30    0
     2016-12-31    1
Name: value, dtype: int64

@mroeschke mroeschke added Groupby Resample resample method and removed Compat pandas objects compatability with Numpy or Python functions labels Apr 2, 2020
@valkum
Copy link

valkum commented Jul 21, 2020

I think I ended up with the same bug. You can find an additional example here: https://repl.it/@valkum/SoulfulFrequentBooleanlogic#main.py
I am not sure if this is related to #35275

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Docs Groupby Resample resample method
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants