Skip to content

BUG: DateOffset weekday around DST produces unexpected results. fixes #5175 #5292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Sep 4, 2014

Conversation

adamgreenhall
Copy link
Contributor

closes #5175

Wound up working on this in preparation for the DST change next week. Here's a first pass at a solution and a test.

@jreback
Copy link
Contributor

jreback commented Oct 22, 2013

the failed build is python 2.6; click on the travis link, then the failed one, scroll down and you can see the test failures.

timedelta doesn't have the total_seconds attribute in 2.6, so you can prob just change the test to compute it

e.g. instead of t.total_seconds()

use t.hour_3600+t.minute_60+t.seconds....which will work under all python versions

@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@adamgreenhall can you rebase

how's this coming?

@jreback
Copy link
Contributor

jreback commented Apr 9, 2014

@adamgreenhall can you rebase this?

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 22, 2014
@jreback
Copy link
Contributor

jreback commented Apr 22, 2014

@adamgreenhall can you rebase?

@adamgreenhall
Copy link
Contributor Author

@jreback - sorry this took so long to get around to - will try to get this done this week. Current status: rebased but many of the offsets tests are failing.

Some of the tests unrelated to DST are failing because of a years kwd. Any ideas? Example:

ERROR: test_calendar (pandas.tseries.tests.test_offsets.TestCustomBusinessDay)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/adam/code/pandas/pandas/tseries/tests/test_offsets.py", line 596, in test_calendar
    assertEq(CDay(calendar=calendar), dt, datetime(2014, 1, 21))
  File "/Users/adam/code/pandas/pandas/tseries/offsets.py", line 504, in __init__
    holidays = kwds['calendar'].holidays()
  File "/Users/adam/code/pandas/pandas/tseries/holiday.py", line 245, in holidays
    rule_holidays = rule.dates(start, end, return_name=True)
  File "/Users/adam/code/pandas/pandas/tseries/holiday.py", line 119, in dates
    year_offset = DateOffset(years=1)
  File "/Users/adam/code/pandas/pandas/tseries/offsets.py", line 124, in __init__
    self._offset = timedelta(**kwds)
TypeError: 'years' is an invalid keyword argument for this function

@adamgreenhall
Copy link
Contributor Author

ok - have my tests on DST passing - just the year offsets are failing now.

@@ -105,12 +105,23 @@ def __add__(date):
"""
_cacheable = False
_normalize_cache = True
_kwds_use_relativedelta = (
'months', 'weeks', 'days',
'month', 'week', 'day', 'weekday',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to add year/years here I think (they are keywords for relativedelta not timedelta)
but then I am confused why you are doing this at all? is relativedelta better at DST calcs? the problem is it is MUCH slower

@adamgreenhall
Copy link
Contributor Author

@jreback - tests are passing - all looks good.

@jreback
Copy link
Contributor

jreback commented Apr 29, 2014

can you check performance on this?

test.perf.sh -b master -t HEAD and post anything >1.2 ratio

also need a release note in Bug Fix section

@jreback jreback modified the milestones: 0.14.0, 0.15.0 Apr 29, 2014
and get_utc_offset_hours(t) == -7)


class TestDST(tm.TestCase):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I think you need to put ALL of these (TestDST) tests into the TestBase object so it will run them for each frequency (you may want to exclude certain of the estoeric ones and/or ones which it does affect); alternatively you can make TestDST inherit from object and then just include that class in the offsets that matter.

The reason I think you need to do this is you are adding logic to apply, but a lot of the offsets override apply and might not call your code. (so its possible we need a common method that ALL offsets call or something).

@adamgreenhall
Copy link
Contributor Author

Will work on the tests tomorrow.

Looks like a few of the performance metrics are taking longer. I imagine that frame_ctor_dtindex_BDay and frame_ctor_dtindex_BMonthEnd are definitely related to the changes. Likely this line.

groupby_first_float32                        |   5.6330 |   4.5720 |   1.2321 |
groupby_last                                 |   5.7027 |   4.5710 |   1.2476 |
frame_ctor_dtindex_BMonthEnd(2)              |   1.7046 |   1.3630 |   1.2506 |
replace_fillna                               |   2.6340 |   2.0750 |   1.2694 |
index_from_series_ctor                       |   0.0297 |   0.0233 |   1.2765 |
datetime_index_intersection                  |   0.4674 |   0.3633 |   1.2863 |
series_drop_duplicates_int                   |   1.1303 |   0.8393 |   1.3467 |
frame_reindex_axis1                          | 240.2460 | 177.1216 |   1.3564 |
stats_rolling_mean                           |   1.2903 |   0.9397 |   1.3731 |
groupby_multi_different_functions            |  19.0346 |  13.5396 |   1.4058 |
frame_ctor_dtindex_BDay(2)                   |   1.9636 |   1.3007 |   1.5097 |
frame_boolean_row_select                     |   0.6274 |   0.3480 |   1.8027 |
frame_from_records_generator_nrows           |   1.9564 |   1.0613 |   1.8434 |

@jreback
Copy link
Contributor

jreback commented Apr 30, 2014

yeah relativedelta is slow (that's why its not used unless you readlly need it).

I would refactor this to make the setup method callable, and then only call it when you really need to.

@adamgreenhall
Copy link
Contributor Author

Is it relativedelta that is slow? that is what is being used on master for all computations. Or is it just the additional timezone conversion?

@jreback
Copy link
Contributor

jreback commented May 1, 2014

iirc relativedelta is only used for month/year offsets

not day or less

@adamgreenhall
Copy link
Contributor Author

It was actually the timezone conversion that was taking a long time. I refactored a bit and now have things running faster (I think).

But one of the custom date offset tests is now failing:

FAIL: test_datetimindex (pandas.tseries.tests.test_offsets.TestCustomBusinessMonthEnd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/adam/code/pandas/pandas/tseries/tests/test_offsets.py", line 754, in test_datetimindex
    datetime(2012,4,30))
AssertionError: Timestamp('2012-05-31 00:00:00', offset='CBM') != datetime.datetime(2012, 4, 30, 0, 0)

Any ideas? I'm not sure exactly what is going on in this test.

Also, the test_perf.sh script is returning inconsistent results between runs. How do you decide what increases are real and because of the changes you made?

Run 1 (>1.2x):

datetime_index_intersection                  |   0.6840 |   0.5631 |   1.2148 |
frame_xs_row                                 |   0.0550 |   0.0439 |   1.2514 |
groupby_sum_booleans                         |   1.8830 |   1.4087 |   1.3368 |
stats_rolling_mean                           |   1.2867 |   0.9611 |   1.3388 |
timeseries_large_lookup_value                |   0.0463 |   0.0320 |   1.4467 |
frame_object_unequal                         |   0.0126 |   0.0076 |   1.6562 |

Run 2 (>1.2x):

dataframe_resample_mean_string               |   2.7219 |   2.2534 |   1.2079 |
frame_assign_timeseries_index                |   1.2950 |   1.0667 |   1.2141 |
timeseries_large_lookup_value                |   0.0473 |   0.0387 |   1.2218 |
frame_reindex_columns                        |   0.5391 |   0.4320 |   1.2478 |
frame_xs_mi_ix                               |   0.5357 |   0.4030 |   1.3293 |
reindex_multiindex                           |   2.9380 |   2.2040 |   1.3330 |
panel_shift                                  |   0.1361 |   0.0920 |   1.4784 |
frame_object_unequal                         |   0.0120 |   0.0080 |   1.4950 |
frame_nonunique_unequal                      |   0.0176 |   0.0116 |   1.5205 |
frame_ctor_dtindex_QuarterEnd(2)             |   1.0840 |   0.6154 |   1.7616 |
groupby_sum_booleans                         |   2.2246 |   1.2584 |   1.7678 |

@jreback
Copy link
Contributor

jreback commented May 2, 2014

rebase on master, that test is fixed

@adamgreenhall
Copy link
Contributor Author

thanks - rebased and all tests passing.

@@ -105,25 +105,49 @@ def __add__(date):
"""
_cacheable = False
_normalize_cache = True
_kwds_use_relativedelta = (
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems that you would use relativedelta for everything (except nanoseconds)? is that right? how does this solve the problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

relativedelta is used for all singluar offsets and plural offsets greater than daily frequency. With this change these calculations are done in UTC.

Also with this change, timedelta is now used for all plural offsets (hours, minutes, seconds) under daily frequency.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why are minutes,hours,seconds,microseconds here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mintue, hour, second, ... -- singular not plural. Because you can do:

In [7]: pd.Timestamp("2013-11-02 00:45") + pd.DateOffset(minute=30)
Out[7]: Timestamp('2013-11-02 00:30:00', tz=None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldn't you do ALL calculations then using relativedelta? why is timedelta not good?
shouldn't you ALWAYS do them in UTC (e.g. convert to/from utc if their is a timezone, then back)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why don't you just tranlsate the plurals then, isn't it simpler?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plurals and singular are not the same -- plurals behave how you would expect, but singulars are used for fixing the time part:

In [2]: t = pd.Timestamp('2014-06-10 8am')

In [3]: t + pd.DateOffset(hour=9)
Out[3]: Timestamp('2014-06-10 09:00:00', tz=None)

In [4]: t + pd.DateOffset(hours=9)
Out[4]: Timestamp('2014-06-10 17:00:00', tz=None)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a bug to me, eg. t + pd.offsets.Hour(9) == t + pd.DateOffset(hours=9) == t + timedelta(hours=9) (and the first is prob how most users do this). Are their tests for the former anywhere? can you point them out.

Is this documented anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find any tests or documentation for the singular form anywhere in pandas. However, I think this behavior is inherited from dateutil.relativedelta, which mentions it in passing in the documentation:

year, month, day, hour, minute, second, microsecond
    Absolute information.
years, months, weeks, days, hours, minutes, seconds, microseconds
    Relative information, may be negative.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can u put that link on the linked issue #7418

@jreback
Copy link
Contributor

jreback commented May 2, 2014

so again, why are you not using hours/minutes/seconds etc using timedelta?

@jreback
Copy link
Contributor

jreback commented May 2, 2014

you need to move the tests as I indicated above, so that all offsets are tested. The way you have it only a very limited number are.

@adamgreenhall
Copy link
Contributor Author

hours/minutes/seconds/... now use timedelta

In [2]: pd.DateOffset(hours=3)._offset
Out[2]: datetime.timedelta(0, 10800)

I'll fix those tests.

@jreback jreback modified the milestones: 0.14.1, 0.14.0 May 5, 2014
return (o.days * 24 * 3600 + o.seconds) / 3600.0


class TestDST(tm.TestCase):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback - want to have a look at these test cases? Went with a limited set of all of the possible DateOffsets

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be a sub-class of object, then use it to inherit in offsets that should test this

e.g.

class TestBusinessDay(TestDST, Base)

@jreback jreback modified the milestones: 0.15.0, 0.15.1 Jul 6, 2014
@jreback
Copy link
Contributor

jreback commented Jul 21, 2014

@adamgreenhall I know we kept pushing this....sorry about that...

can you rebase and can revist (I think let's just put it in regardless of #7418)

cc @rockg
cc @sinhrks


if tzinfo is not None and self._use_relativedelta:
# bring tz back from UTC calculation
other = tzinfo.localize(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After #7465, reverts tz back is being performed in apply_wraps (generalized version of apply_nat).

And tzinfo.localize raises AttributeError for dateutil timezone. #7697 is going to fix this in apply_wraps.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

@adamgreenhall can you rebase?

@adamgreenhall
Copy link
Contributor Author

@jreback - rebased.

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

thanks. can you add a release note in v0.15.0.txt. this is a bug fix? and pls squash

@jreback
Copy link
Contributor

jreback commented Aug 5, 2014

@rockg @sinhrks pls have a look


if tzinfo is not None and self._use_relativedelta:
# bring tz back from UTC calculation
other = tzinfo.localize(other)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this isn't required as apply_wraps does. Otherwise, use tslib._localize_pydatetime to support dateutil, and better to have a test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you do need the tslib._localize_pydatetime. Tested in test_offsets.test_fallback_singular, which fails without it.

@adamgreenhall
Copy link
Contributor Author

@jreback @sinhrks @rockg - addressed comments, added some tests, added to release notes, rebased, and squashed. Want to have another look?

@adamgreenhall
Copy link
Contributor Author

fixed some py2.6 and py3 compatibility issues, rerunning tests.

@adamgreenhall
Copy link
Contributor Author

I was having some issues with nanoseconds in other versions of python/numpy. I wound up taking nanoseconds out of the DST tests (in hindsight it seemed like overkill to have them there). All tests are passing now. @jreback @sinhrks - can you take a look?

offset = DateOffset(**{offset_name: offset_n})
t = tstart + offset
if expected_utc_offset is not None:
assert(get_utc_offset_hours(t) == expected_utc_offset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't use any bare asserts, instead use self.assertTrue (or False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@adamgreenhall adamgreenhall force-pushed the dst-date-offset branch 2 times, most recently from 2b36e9d to 2a3bdce Compare August 19, 2014 20:07

if len(kwds_no_nanos) > 0:
if any(k in self._kwds_use_relativedelta for k in kwds_no_nanos):
self._use_relativedelta = True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of using this variable, why don't you make this a cached_readonly property that tests isinstance(self._offset,relativedelta), much more intuitive I think

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or you can use that kw, but return it with the offset. What you are doing here is really odd.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure how the cached_readonly property would work - that's not a pattern I'm familiar with (is there an example handy?). I'll return the return _use_relativedelta.

@adamgreenhall
Copy link
Contributor Author

any updates on this? need anything from me?

@jreback jreback merged commit 2a42334 into pandas-dev:master Sep 4, 2014
@jreback
Copy link
Contributor

jreback commented Sep 4, 2014

@adamgreenhall merged! this has been around a while! thanks!

@adamgreenhall
Copy link
Contributor Author

😸

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: DateOffset weekday around DST produces unexpected results
3 participants