-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: Support dateutil timezones. GH4688. #6968
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
can you provide a vbench and its results (there maybe already something in there), but if not, pls add; essentially working with Timestamp/DatetimeIndex with a timezone for pytz & dateutil would be useful |
I've tried running the vbench tests but they fail part way through with an error about checksums not being unique (output below). This happens with the current pandas master and my branch. It looks like there are duplicate tests in the test suite. Packages installed are: Error was: sqlalchemy.exc.IntegrityError: (IntegrityError) column checksum is not unique u'INSERT INTO benchmarks (checksum, name, description) VALUES (?, ?, ?)' ('ea1993ef61c3cc4e871d2cce3c5d983c', 'eval_frame_chained_cmp_python', None) I hacked around in eval.py a bit and got past this error (by unduplicating tests by commenting out
I'm not sure where to go to fix this (and I'm wary of changing the tests too much!). Any suggestions? |
are you sure you are rebase to master? you are running |
That's the command I was running. I was rebased to master when I submitted the PR. I'll be a little bit behind now. I've rebased to master and I still get the checksum/duplicate test error. If I comment out the If being rebased to master is a requirement for running the vbench tests, it would be great to add it to the wiki page (https://github.com/pydata/pandas/wiki/Performance-Testing). |
its not, I am just puzzled by your error. |
@cpcloud did we have an issue with the |
ah ok. Rebasing seems to have fixed things anyway. Results below. I'm a bit surprised that some things were 2 * faster on the head than the base - possibly some uneven load on the node I was using. Ditto for the things that are 1.5 times slower..
|
FYI, use triple-quotes around the text to format take the vbench with a grain of salt....disk/cpu load can affect it. I I find something that is consistenly worse/better then I would just run it individually (in ipython, and things < 1ms are sort of randomish anyhow. you can also run a subset of tests with |
which of the tail tests (better or worse results) affected by this? (maybe rerun that as a subset) |
I'll have a go tomorrow when I can run the tests locally and see if it's a bit more consistent. Thanks for the tip on the text. |
I've run the tests locally, the results are much the same. The really slow one is an anomaly, I reran the tests and it was the same speed.
|
ok, pls rebase to master and then rerun. some of these outliers have already been fixed. need to narrow down your changes. |
Updated results:
|
you get pretty variable results, maybe you are running other things at the same time? usually try to run vbench when nothing else is running (this espectially mucks up the disk i/o ones). |
I'll set it going this evening when I leave the office so that it's uninterrupted. |
Sorry for the delay with the test figures, other stuff took over at work. I've got the vbench suite running on a dedicated machine now and I've run the tests a couple of times to get a consistent result. The table below shows test name, individual results, baseline test time and mean result. It looks like there are just a few tests that are significantly slower than the base. Let me know what you think.
|
looks ok, maybe investigate the very last test a bit (dti_reset_index_tz). generally a bench that is say < 1ms is really hard to measure as lots of things can throw it off (e.g. why the 2nd to last one is no biggie) |
There were a few functions in tslib causing the slowdown - stuff which could have been better cythoned. I've tweaked a few things to keep a bit more in cython and the performance looks better. I've broken two seemingly unrelated tests though.
If that's good enough performance-wise then I'll fix the failing tests and push my changes tomorrow. |
looks good |
I've rebased to master and fixed the merge. I can squash all of the commits once we're happy with the code and get the tests passing... There are some tests failing on Travis in python 2.6. This seems to be a difference in numpy (e.g. fails in 1.6.1 but not 1.8.0 or our internally patched 1.7.1 on both python 2.6 and python 2.7). I think it'll be very tricky to work out why there's a difference though - the problem will lie in tslib.tz_localize_to_utc which is quite complicated. |
ok, numpy < 1.7 is pretty much completely broken for timedelta64. |
Does that apply here? We're passing in np.array of int64s not dates, and I think all of the infer_dst logic stays in the int64 world, avoiding timedelta64... |
not sure what the error is; just that < 1.7 has lots of issues (not so many with datetimes though) |
Some more digging and this appears to be the problem: # numpy 1.6
np.array(10, dtype='M8[s]').astype('M8[ns]').view('i8') == 10
# numpy 1.7
np.array(10, dtype='M8[s]').astype('M8[ns]').view('i8') == 10,000,000,000 This is in tslib._get_transitions for dateutil timezones. Multiplying by 1e9 in numpy < 1.7 has fixed the tests. I've also rebased to master and squashed everything into one commit. |
yep |
ah ok. I like the clarity ("convert from seconds to nanoseconds") and numpy 1.6 will be retired eventually.. Is there anything else I need to do before this is merged into the master branch? |
Enhancements | ||
~~~~~~~~~~~~ | ||
|
||
- Support for dateutil timezones, which can now be used in the same way as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put an issue link here (to the original issue)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
Have rebased to master, tests pass locally. The performance test was fine, no changes to before. The travis build seems to be taking a while though.. |
travis turn around time is pretty random, I think it depends on their entire system load. sometimes pretty quick, sometimes not. |
@jorisvandenbossche @cpcloud pls have a look if you can |
return hasattr(tz, '_trans_list') and hasattr(tz, '_trans_idx') | ||
|
||
|
||
cdef object _tz_cache_key(object tz): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be inline
(in general a cdef function that is 'short' should/can be inline)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
travis very slow today....https://travis-ci.org/pydata/pandas/builds |
@@ -2,6 +2,8 @@ | |||
from datetime import datetime, time, timedelta, tzinfo, date | |||
import sys | |||
import os | |||
import unittest | |||
import itertools | |||
import nose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think you need the itertools import here.
@dbew few minor comments but looks good otherwise. I would say cdef (instead of cpdef) functions that are not called from Python. That way Cython won't generate Python c API calls until Abs necessary. |
Which usually at function and class entry and exit points FYI |
|
||
def _is_tzlocal(tz): | ||
return isinstance(tz, tzlocal) | ||
cdef bint _is_tzlocal(object tz): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty sure you can inline this
Adds support for dateutil timezones across pandas. Support conversion between dateutil and pytz timezones. Ensure timezone functionality is tested with both dateutil and pytz.
great ping me on green |
@cpcloud @jorisvandenbossche any more comments? |
I was using a pep8 code checker in pydev while I was writing the code so it should be reasonably pep8 compliant. I'll take a look at pep8radius. It'd be good to get this into the master branch though.. |
Ok, I'll give pep8radius a go soon. |
@dbew ping when you are ready for this (after/if anything needs peping) |
@jreback All ready. @jorisvandenbossche was happy with using the pydev pep8 checker. I'm not going to get chance to look at pep8radius today. |
ENH: Support dateutil timezones. GH4688.
thanks @dbew hopefully no unexpected fallout from this....but I think you did a great job shephearding this! |
No problem, it's been interesting. It's something that we'll be using inside AHLMSS so we'll keep an eye out for issues. |
closes #4688
This PR should provide support for dateutil timezones. It was discussed quite a bit in #4689.
The discussion there ended two months back with "we should incorporate this if it can be made seamless". I think I've managed that now so it would be great to get some feedback on the changes in this PR.
Everything that works with pytz timezones should work with dateutil timezones and you shouldn't notice any difference in behaviour when you change between them. There are two exceptions which I'm looking at now: saving data to ujson and pytables.
The changes I've made just allow pandas to treat dateutil timezones exactly the same as pytz timezones (from the user perspective) - just extending conversion logic to deal with both where appropriate. This has made a few methods a bit more complicated but relatively few changes were required.
All of the significant changes are in tslib.pyx. Almost all of the other changes are adding test cases for the dateutil timezones. The test suites pass locally.
Let me know if you have any questions or if there's anything that needs doing before the PR can be accepted.