Skip to content

Can't use dt.ceil, dt.floor, dt.round with coarser target frequencies #15303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
rosnfeld opened this issue Feb 3, 2017 · 11 comments
Open

Can't use dt.ceil, dt.floor, dt.round with coarser target frequencies #15303

rosnfeld opened this issue Feb 3, 2017 · 11 comments
Labels
Datetime Datetime data dtype Enhancement Frequency DateOffsets

Comments

@rosnfeld
Copy link
Contributor

rosnfeld commented Feb 3, 2017

Code Sample

This works fine:

In [16]: hours = pd.Series(pd.date_range('2017-01-01', '2017-01-31', freq='H'))

In [17]: hours.dt.ceil('D')
Out[17]: 
0     2017-01-01
1     2017-01-02
2     2017-01-02
3     2017-01-02
4     2017-01-02
...

But this doesn't:

In [18]: days = pd.Series(pd.date_range('2017-01-01', '2017-01-31', freq='D'))

In [19]: days.dt.ceil('M')
...
/home/andrew/git/pandas-rosnfeld-py3/pandas/tseries/offsets.py in nanos(self)
    510     @property
    511     def nanos(self):
--> 512         raise ValueError("{0} is a non-fixed frequency".format(self))
    513 
    514 

ValueError: <MonthEnd> is a non-fixed frequency

Problem description

I haven't done an exhaustive test, but it seems that one can't use coarser frequencies like 'M', 'Q', 'A' in the dt.ceil/floor/round set of methods.

Expected Output

It would seem reasonable to be able to take a series of scattered dates and be able to "coarsen" them into month-ends, quarter-ends, year-ends, etc.

Output of pd.show_versions()

(admittedly some pretty old stuff in this environment but I also got the same behavior on 0.19.2 with latest of everything else) INSTALLED VERSIONS ------------------ commit: 74e20a0 python: 3.5.1.final.0 python-bits: 64 OS: Linux OS-release: 4.4.0-62-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_GB.UTF-8 LOCALE: en_GB.UTF-8

pandas: 0.19.0+300.g74e20a0.dirty
nose: 1.3.7
pip: 7.1.2
setuptools: 19.1.1
Cython: 0.23.4
numpy: 1.10.2
scipy: 0.16.1
statsmodels: None
xarray: None
IPython: 4.0.1
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2016.4
blosc: None
bottleneck: 1.0.0
tables: 3.2.2
numexpr: 2.4.6
feather: None
matplotlib: 1.5.1
openpyxl: 2.3.2
xlrd: 0.9.4
xlwt: 1.0.0
xlsxwriter: 0.8.2
lxml: 3.5.0
bs4: None
html5lib: 0.999
httplib2: None
apiclient: None
sqlalchemy: 1.0.11
pymysql: 0.6.7.None
psycopg2: None
jinja2: 2.8
s3fs: None
pandas_datareader: None

@jreback
Copy link
Contributor

jreback commented Feb 4, 2017

Here is an alternative and probably more correct.

In [3]: s.dt.to_period('M').dt.to_timestamp()
Out[3]: 
DatetimeIndex(['2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01',
               ...
               '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01', '2017-01-01'],
              dtype='datetime64[ns]', length=721, freq=None)

The issue is rounding is pretty tricky and I would argue unexpected for Months. You can easily have round up on the 15th for some months and down for others. You could argue that .ceil/.floor are unambiguous though.

If you want to implement something reasonable would be fine. (you could prob just do the .to_period transform under the hood).

@jreback jreback added Difficulty Intermediate Frequency DateOffsets Datetime Datetime data dtype labels Feb 4, 2017
@jreback jreback added this to the Next Major Release milestone Feb 4, 2017
@rosnfeld
Copy link
Contributor Author

Taking a look at this: if the date is '2017-01-23 12:34:56', what should floor and ceil return for a frequency like 'M' (MonthEnd)? If you're doing .dt.to_period('M').dt.to_timestamp(), then you get '2017-01-01' - but that's not a MonthEnd value. Similarly if you took the end_time of that period, you'd get '2017-01-31 23:59:59.999999999', also not a MonthEnd. I think you actually want '2016-12-31 00:00:00' (for floor) and '2017-01-31 00:00:00' (for ceil) - right? So I am not sure that to_period is the key here.

While I wrote this issue for Series, upon playing with this further I've noticed that it also affects DatetimeIndex and individual Timestamp values. I have a fix for individual values (e.g. pd.to_datetime('2017-01-23 12:34:56').ceil('M') now behaves as I think is right in the paragraph above), but I'm actually not sure how to fix the Series/DatetimeIndex case.

My patch currently affects the Timestamp._round() code in tslib.pyx, and goes something like the following, using the offset rollforward/rollback to stay on-offset for at least the floor and ceil cases:

        from pandas.tseries.offsets import Tick
        offset = to_offset(freq)

        if not isinstance(offset, Tick):
            offset.normalize = True

            if rounder == np.ceil:
                return offset.rollforward(self)

            if rounder == np.floor:
                return offset.rollback(self)

            raise ValueError('Can only round to Tick offsets')

There is a very similar TimelikeOps._round() method in datetimelike.py, but I can't do the same change there, as the offset rollforward/rollback methods only work on Timestamp. Suggestions are welcome - not sure how one normally "scales up" more complicated individual operations like this to Series in a performant way. Also not sure if there is a cleaner/more obvious way to do what I've done in the single Timestamp case.

@jreback
Copy link
Contributor

jreback commented Sep 6, 2017

@rosnfeld

This is how you work with dti/Series; its very similar.

In [10]: i = pd.to_datetime(['2017-01-23 12:34:56'])

In [11]: i
Out[11]: DatetimeIndex(['2017-01-23 12:34:56'], dtype='datetime64[ns]', freq=None)

In [12]: i.normalize() + pd.offsets.MonthBegin()
Out[12]: DatetimeIndex(['2017-02-01'], dtype='datetime64[ns]', freq=None)

In [13]: i.normalize() + pd.offsets.MonthEnd()
Out[13]: DatetimeIndex(['2017-01-31'], dtype='datetime64[ns]', freq=None)

@rosnfeld
Copy link
Contributor Author

(got motivated to come back to this issue as I hit it again...)

Thanks @jreback - I am very close now. Only issue is that adding the offset isn't idempotent/doesn't handle the exact boundary properly. That is, in the single Timestamp case, timestamp.ceil('M').ceil('M') the second application correctly does nothing, but in the series case series.dt.ceil('M').dt.ceil('M') rolls the month again, since it adds another offset.

offset.rollforward(timestamp) is smart enough to call offset.onOffset(timestamp) to catch this case. I guess I need a series equivalent that somehow incorporates this kind of logic... but may be tricky to do in an efficient way.

@batterseapower
Copy link
Contributor

If you just want to roll forward then i.normalize() + 0*pd.offsets.MonthBegin() is an idempotent operation.

@Safrone
Copy link

Safrone commented Jan 7, 2021

I'm interested in this still, I can't find a reasonable way to snap a datetime to a week beginning

the best I can do is:

import pandas as pd
end_time = pd.Timestamp('12/15/2020 9:24:00', tz='US/Eastern')
intervals = ['W-SUN', 'H', 'T']
if interval.name == 'W-SUN':
    query_end_time = pd.DatetimeIndex([end_time.normalize()]).snap(pd.offsets.Week(weekday=6))[0]
else:
    query_end_time = end_time.floor(interval.code)

.floor used to work for 'W-SUN'

@jpacostar
Copy link

This worked for me :)

hours.dt.to_period('W-SUN').dt.start_time

@xuancong84
Copy link

It is so surprising that this ticket has been opened in 2017 and has not been fixed in 4 years. It is very obvious that this is due to incompetent programming nature when it comes to grouping dates into different buckets. As I mentioned in #43093, the internal logic inconsistency of Pandas will only make these kinds of defects harder to rectify.

If grouping by years or months cannot be done easily, I have no objection. This is because one year has ~365.2425 days, as a result, the number of days in a year and thus in a month cannot be determined in a straight-forward manner, you need to determine where are the leap years and leap months (especially every 4 years, except for years that are divisable by 100 and not divisable by 400 we got a leap year). However, the number of days in a week is always 7, this is not affected by leap year. Thus, this is something that can be easily calculated just using a simple MOD operation on the raw timestamp. Not sure why pd.Timestamp.floor('W') or W-SUN/W-MON/etc. still does not work until today.

@jreback
Copy link
Contributor

jreback commented Aug 19, 2021

@xuancong84

statements like this

It is very obvious that this is due to incompetent programming nature when it comes to grouping dates into different buckets

will get you banned, pls follow the code-of-conduct.

constructive criticisim and patches are welcome.

@MarcoGorelli
Copy link
Member

MarcoGorelli commented Aug 19, 2021

@xuancong84 it's not the first time you make such comments:

So do we have another general inconsistency due to incompetent programming?

#32263 (comment)

Thus, incompetent programming and inherently deficient/defective architectural design does not allow this bug to be solved that easily ^_^

#33202

-1 on @jreback , you seem to view everything as negative and simply refuse to implement anything useful to the general public, your attitude is very negative and arrogant. Okay, so be it.

#33738

So, please take the above as a final warning - thanks

@xuancong84
Copy link

xuancong84 commented Aug 20, 2021

It is so surprising that this ticket has been opened in 2017 and has not been fixed in 4 years. It is very obvious that this is due to incompetent programming nature when it comes to grouping dates into different buckets. As I mentioned in #43093, the internal logic inconsistency of Pandas will only make these kinds of defects harder to rectify.

I do apologize that maybe my comments are a bit harsh for you guys. But why do I only make harsh kind of comments only in your repo, but not everyone else's repo? I hope you can focus more of your energy in solving those bugs rather than solving the people who complain about your bug.

I will make no further harsh comments on Pandas repo in the future but on this issue, may I know which part of the source code needs to be changed in order to solve the bug?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Datetime Datetime data dtype Enhancement Frequency DateOffsets
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants