Skip to content

Last day of month should group with that month #17898

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
louispotok opened this issue Oct 16, 2017 · 7 comments
Open

Last day of month should group with that month #17898

louispotok opened this issue Oct 16, 2017 · 7 comments
Labels
Bug Resample resample method

Comments

@louispotok
Copy link
Contributor

louispotok commented Oct 16, 2017

Code Sample, a copy-pastable example if possible

import pandas as pd
dts = pd.date_range(start='2016-09-01',end='2017-08-31')
df = pd.DataFrame(
    {
        'day': dts, 
        "val": np.random.rand(len(dts)), 
    }
)
right_cls = df.groupby(pd.Grouper(key='day',freq='4M', closed="right")).size()
left_cls = df.groupby(pd.Grouper(key='day',freq='4M',closed="left")).size()

print(right_cls.index.max())
# pd.Timestamp('2017-09-30 00:00:00')
print(left_cls.index.max())
# pd.Timestamp('2017-12-31 00:00:00')

Problem description

I don't know whether this is a bug or whether I don't understand the expected behavior. I would expect the max index value in both cases to be "2017-08-31", but it looks like it's considering 08-31-2017 to be in September. If I have an observation on 2017-08-31 before midnight, it should still be grouped into the period ending 2017-08-31, right?

Expected Output

# This should work
assert right_cls.index.max() == pd.Timestamp('2017-08-31 00:00:00')
assert left_cls.index.max() == pd.Timestamp('2017-08-31 00:00:00')

Output of pd.show_versions()

I'm using pandas 0.20.3 here, but I also checked this on the latest commit and it looks like the behavior persists.

INSTALLED VERSIONS ------------------ commit: None python: 3.6.2.final.0 python-bits: 64 OS: Linux OS-release: 4.10.0-37-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: None
pip: 9.0.1
setuptools: 36.4.0
Cython: None
numpy: 1.13.1
scipy: 0.19.1
xarray: None
IPython: 6.1.0
sphinx: None
patsy: 0.4.1
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 0.9999999
sqlalchemy: 1.1.13
pymysql: None
psycopg2: 2.7.1 (dt dec pq3 ext lo64)
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Oct 17, 2017

looks like a fencepost problem. if you'd like to debug would be great

for M

In [21]: import pandas as pd
    ...: dts = pd.date_range(start='2016-09-01',end='2017-08-31')
    ...: df = pd.DataFrame(
    ...:     {
    ...:         'day': dts, 
    ...:         "val": 1
    ...:     }
    ...: )
    ...: 
    ...: 

In [22]:  df.groupby(pd.Grouper(key='day',freq='4M', closed='left')).size()
Out[22]: 
day
2016-12-31    121
2017-04-30    120
2017-08-31    123
2017-12-31      1
Freq: 4M, dtype: int64

In [24]:  df.groupby(pd.Grouper(key='day',freq='4M', closed='right')).size()
Out[24]: 
day
2016-09-30     30
2017-01-31    123
2017-05-31    120
2017-09-30     92
Freq: 4M, dtype: int64

for MS

In [29]:  df.groupby(pd.Grouper(key='day',freq='4MS', closed='right')).size()
Out[29]: 
day
2016-05-01      1
2016-09-01    122
2017-01-01    120
2017-05-01    122
Freq: 4MS, dtype: int64

In [30]:  df.groupby(pd.Grouper(key='day',freq='4MS', closed='left')).size()
Out[30]: 
day
2016-09-01    122
2017-01-01    120
2017-05-01    123
Freq: 4MS, dtype: int64

@jreback jreback added this to the Next Major Release milestone Oct 17, 2017
@louispotok
Copy link
Contributor Author

Looking into
df.groupby(pd.Grouper(key='day',freq='4M', closed='right')).size() and putting notes here in case they are helpful:

It looks like here we do:

    if closed == 'left':
        first = Timestamp(offset.rollback(first))
    else:
        first = Timestamp(first - offset)

    last = Timestamp(last + offset)

So in the closed='right' case, we:

  • change first: 9/1 -> 5/31
  • change last: 8/31 -> 12/31

first and last are then used to create a DateTime Index which is the basis for the return index here:
DatetimeIndex(['2016-05-31', '2016-09-30', '2017-01-31', '2017-05-31', '2017-09-30'] , dtype='datetime64[ns]', name='day', freq='4M')
Finally we cut off the first value in that index because we are closed right.

In the debugger, if I manually overwrite that dt index with pd.DatetimeIndex(freq=self.freq,end=last, periods=5) the rest of the code seems to do the right thing:

2016-12-31    122
2017-04-30    120
2017-08-31    123

Will look into this more a little later.

@heydenberk
Copy link

Is this the same core issue as #13022—and hence, addressed by #22488 and due to land with 0.24?

@louispotok
Copy link
Contributor Author

@heydenberk just retried my sample on master and the output is the same, so looks like it hasn't been fixed.

@pbogaard
Copy link

Any updates on this?

@louispotok
Copy link
Contributor Author

I haven't looked into this any further, but just retried the test case on 1.2.1 and the same issue persists.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 5, 2021

This is probably just me not understanding, but I don't see what's wrong with the current output

For a smaller example, if we have

In [45]: pd.DataFrame({'day': pd.date_range(start='2016-09-01', end='2016-09-30'
    ...: )}).groupby(pd.Grouper(key='day', freq='M', closed='left')).size()
Out[45]: 
day
2016-09-30    29
2016-10-31     1
Freq: M, dtype: int64

then do you think the output should be

In [45]: pd.DataFrame({'day': pd.date_range(start='2016-09-01', end='2016-09-30'
    ...: )}).groupby(pd.Grouper(key='day', freq='M', closed='left')).size()
Out[45]: 
day
2016-09-30    30
Freq: M, dtype: int64

?

Because if you want the interval to be closed on the left, then it seems correct to me that 2016-09-30 would not appear in the interval [2016-08-31, 2016-09-30)


In the given example, if we start with 2016-09-01 and go forwards in 4-month intervals closed on the left, then we get the intervals:

[2016-05-31, 2016-09-30)
[2016-09-30, 2017-01-31)
[2017-01-31, 2017-05-31)
[2017-05-31, 2017-09-30)

and it seems correct to me that 2017-08-31 would go in the interval [2017-05-31, 2017-09-30)

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
Development

No branches or pull requests

7 participants