Skip to content

resampling closed='left' incorrect ? #5440

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Cd48 opened this issue Nov 5, 2013 · 8 comments
Closed

resampling closed='left' incorrect ? #5440

Cd48 opened this issue Nov 5, 2013 · 8 comments
Labels
Bug Resample resample method

Comments

@Cd48
Copy link

Cd48 commented Nov 5, 2013

Hi,

Coming back to issue #2665, I'm still not convinced that the closed-argument of the resampling function does what it should.

When resampling daily to monthly values with closed = 'left', the closed argument should relate to the actual time the months change, but instead it seems to refer to the time 24 hours earlier.

In [25]: import pandas as pd
    ...: import numpy as np
    ...: dates = pd.date_range('3/29/2000', '4/3/2000', freq='1D')

In [27]: ts1 = pd.Series(np.ones(len(dates)), index=dates)

In [28]: ts2 = ts1.resample('1M', closed='left',label='left').sum()

In [29]: ts3 = ts1.resample('1M').sum()

In [30]: ts1
Out[30]:
2000-03-29    1.0
2000-03-30    1.0
2000-03-31    1.0
2000-04-01    1.0
2000-04-02    1.0
2000-04-03    1.0
Freq: D, dtype: float64

# resampling and asking that the first sample of the month (midnight April 1st) is part of that month does not result in what you think. Instead pandas does not consider 01/04/2000 00:00 as the time the months change, but it thinks the months change at 31/03/2000 00:00 (one day too early)
In [31]: ts2
Out[31]:
2000-02-29    2.0
2000-03-31    4.0
Freq: M, dtype: float64


# keeping it simple is a possible solution but it would be better to have the closed-argument working correctly ...

In [32]: ts3
Out[32]:
2000-03-31    3.0
2000-04-30    3.0
Freq: M, dtype: float64
@jtratner
Copy link
Contributor

jtratner commented Nov 5, 2013

Can you post the output you expect compared to what you actually get? Makes
it easier to follow and also to write better test cases to make sure the
problem is resolved.

@cancan101
Copy link
Contributor

Might be worth looking at this issue as well: #5023

@kdebrab
Copy link
Contributor

kdebrab commented Nov 24, 2013

I think the expected behavior is:

>>> ts1.resample('1M', how='sum', closed='left',label='right')
2000-03-31    3
2000-04-30    3
Freq: M, dtype: float64
>>> ts1.resample('1M', how='sum', closed='right',label='right')
2000-03-31    4
2000-04-30    2
Freq: M, dtype: float64

Of course, in that case the default behavior should become closed='left'.

See also my comment on issue #2665.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 5, 2014
@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 3, 2015
@mroeschke mroeschke added the Resample resample method label Dec 24, 2019
@mroeschke mroeschke added the Bug label Mar 31, 2020
@mroeschke mroeschke removed Period Period data type Datetime Datetime data dtype labels Apr 11, 2021
@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
@MarcoGorelli
Copy link
Member

MarcoGorelli commented Feb 28, 2023

This looks correct to me

You have

In [5]: ts1
Out[5]: 
2000-03-29    1.0
2000-03-30    1.0
2000-03-31    1.0
2000-04-01    1.0
2000-04-02    1.0
2000-04-03    1.0

If you do resample('1M', closed='left', label='left'), then you end up with the following groups:

  • 2000-02-29: [2000-02-29, 2000-03-31)
  • 2000-03-31: [2000-03-31, 2000-04-30)

If you look at the values in each group you get:

  • [1, 1]
  • [1, 1, 1, 1]
    so, after the sum aggregation, you get:
- 2000-02-29: 2.0
- 2000-03-31: 4.0

As documented here, 'M' means "month end". So the choice of bins looks correct as well.

I think the OP's phrase

Instead pandas does not consider 01/04/2000 00:00 as the time the months change

suggests they misunderstood what 'M' does, and that perhaps they meant 'MS', which gives

In [9]: ts1.resample('1MS', closed='left', label='left').sum()
Out[9]: 
2000-03-01    3.0
2000-04-01    3.0
Freq: MS, dtype: float64

Closing then - please do let me know if I've misunderstood and I'll reopen

@jorisvandenbossche
Copy link
Member

I think you are technically speaking right that this is "correct", but it is also very confusing ..

I suppose the main confusing part is that the month end/start labels are expressed in full days, and thus this is the start of the last day of the month. So when using closed="left", also the hours in the middle of the last day of a previous month are considered part of the next month, and not only the exact separating instant between both months (i.e. 2000-03-31 23:59:59.999.. / 2000-04-01 00:00:00).

This contrasts to how for example the Day offset works, where the division between both periods is the exact instant.
I suppose for months this was done to have it play nicer with daily frequencies and because we want to show the month value in the resulting DatetimeIndex as a date (i.e. "2000-03-31" and not "2000-03-31 23.59.59.999" as the month end value)

I think this is also basically what Stephan explains in #9586

Given the way that the MonthEnd frequency is defined as the start of the last day, I doubt that it ever makes sense to use this with closed="left". To avoid this confusion, maybe we could raise a warning/error when the user does this, pointing them to use freq="MS" (MonthStart) instead of MonthEnd?

@MarcoGorelli
Copy link
Member

I doubt that it ever makes sense to use this with closed="left"

Yes, I was thinking that - agree on raising (or at least warning), I'll open a new issue

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Mar 1, 2023

I think we can use #9586 for that, since it's basically closing the confusion that Stephan describes there?

@MarcoGorelli
Copy link
Member

such old issues tend to not get any attention/activity, and the example doesn't run anymore as it's Python2 and a really old version of pandas

furthermore, that issue sounds like it's more about renaming a confusing freq?

I suppose we could add ME, etc., for month end, but changing the offset M from month-end to month-start seems like a non-starter. Ugh. So I guess we're left with a documentation issue (#5023), unless we want to add a hack for resample.

I've opened #51710 as warning/erroring is a different activity

I'll have a think, but I agree something should be done as this is really confusing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method
Projects
None yet
Development

No branches or pull requests

8 participants