-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
PERF: leverage tzlocal package to provide 2000x speedup for dateutil.tz.tzlocal operations #24737
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report
@@ Coverage Diff @@
## master #24737 +/- ##
==========================================
- Coverage 92.39% 92.38% -0.01%
==========================================
Files 166 166
Lines 52358 52358
==========================================
- Hits 48374 48373 -1
- Misses 3984 3985 +1
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #24737 +/- ##
===========================================
- Coverage 92.37% 42.75% -49.62%
===========================================
Files 166 171 +5
Lines 52408 52582 +174
===========================================
- Hits 48412 22484 -25928
- Misses 3996 30098 +26102
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall we need to consider whether this is worth bumping up our min dateutil version and adding an optional dependency to accelerate only dateutil.tz.tzlocal
Alternatively, could tzlocal
be vendored?
ci/deps/azure-27-compat.yaml
Outdated
@@ -10,10 +10,11 @@ dependencies: | |||
- numpy=1.12.0 | |||
- openpyxl=2.5.5 | |||
- pytables=3.4.2 | |||
- python-dateutil=2.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need to discuss bumping up our minimum version of dateutil
pandas/_libs/tslibs/timezones.pyx
Outdated
@@ -34,6 +36,38 @@ cdef inline bint is_tzlocal(object tz): | |||
return isinstance(tz, _dateutil_tzlocal) | |||
|
|||
|
|||
cpdef bint have_tzlocal_package(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't this just be run once with the imports and just assign the sentinel value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely
pandas/_libs/tslibs/timezones.pyx
Outdated
|
||
try: | ||
return pytz.timezone(tzlocal_tz) | ||
except pytz.exceptions.UnknownTimeZoneError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't think you need the exceptions
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tzlocal
will report the system tz even if it isn't a valid pytz
tz. For example, we'll hit this case if /etc/timezone
contains foo/bar
or, more likely, an ambiguous user timezone like STD
.
@@ -820,7 +820,8 @@ def test_timetz_accessor(self, tz_naive_fixture): | |||
|
|||
index = DatetimeIndex(['2018-06-04 10:20:30', pd.NaT], tz=tz) | |||
result = index.timetz | |||
|
|||
print('result', result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover prints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, will clean up
@@ -536,6 +536,8 @@ def test_dt_timetz_accessor(self, tz_naive_fixture): | |||
expected = Series([time(23, 56, tzinfo=tz), time(21, 24, tzinfo=tz), | |||
time(22, 14, tzinfo=tz)]) | |||
result = s.dt.timetz | |||
print('result', result) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leftover prints?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we just moved min version for python dateutil
how hard is it to vendor tzlocal?
@mroeschke Agreed, it's worth mentioning that this mostly fixes the top performance regression between
It'd be relatively simple to vendor it as it's only a handful of files. It's MIT licensed, so should be fine but worth mentioning now. I haven't been able to reproduce the |
if we bump to 2.6.0 would u still need tzlocal? |
I generally like the idea of avoiding returning Do we have any read on how accurate/canonical tzlocal.get_localzone is? For instance I just called it and got "America/Los_Angeles" back, while I was expecting to get "US/Pacific". Do/should we care about that distinction? Also FWIW, the way dateutil decides to return a tzlocal() object is based on looking at |
@jreback Yes, there's just a hard-to-reproduce conflict with 2.5.0-2.5.3. Not sure how serious it is at this point. @jbrockmendel Regarding accuracy, it's pulling from system files like There's no distinction between |
Although tzlocal hurts portability (unless that's the intension), I think pandas should ensure that timezones in == timezones out and shouldn't make any assumptions or transformation when returning the timezone (i.e tzlocal in, tzlocal out; US/Pacific in, US/Pacific out), which I can see your PR does @qwhelan . Overall agreed that I'd be great to address this regression. Shot in the dark: I think the proper way to compare dateutil tzs is with |
use |
64e8788
to
79bb80e
Compare
Hello @qwhelan! Thanks for updating the PR.
Comment last updated on February 05, 2019 at 07:53 Hours UTC |
3c66b5f
to
702b29a
Compare
@jreback I'm finally able to reproduce the |
9c695f7
to
6e37589
Compare
…tz.tzlocal operations
put in pandas/tseries/_tzlocal.py i think - see how that works |
@jreback How about |
sure that’s ok too |
Is there anything in particular that bears cythonizing? |
Haven't looked closely, but I imagine most all functions here can be |
@@ -17,7 +17,7 @@ cimport numpy as cnp | |||
from numpy cimport int64_t | |||
cnp.import_array() | |||
|
|||
|
|||
cimport pandas._libs.tslibs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cython was making noise about this when running under 2.7; looks like it doesn't raise the same error under 3. Will remove this.
@@ -3,6 +3,9 @@ | |||
cpdef bint is_utc(object tz) | |||
cdef bint is_tzlocal(object tz) | |||
|
|||
cpdef object get_tzlocal_tz(object tz) | |||
cpdef _set_tzlocal_tz(object tz) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used externally? If so, it shouldn’t be private. If not, it shouldn’t be in the pxd file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently being used to assist in testing, where we cycle through all pytz
timezones and check equivalence against tzlocal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I wouldn't change anything in the vendored copy (its hard to tell if you did); I just saw a comment about 'cythonizing it'. Just drop it in place.
@@ -0,0 +1,5 @@ | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't put this in src (which is not a real package, nor do we want it to be), rather just in pandas/_libs/tzlocal
is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alternative would be something like pandas/vendored/
. pd.io.clipboard
would also belong in such a directory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this pandas/_vendored/
(explicity private)
@@ -0,0 +1,5 @@ | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you make it clear that this is a vendored copy
@@ -34,6 +37,25 @@ cdef inline bint is_tzlocal(object tz): | |||
return isinstance(tz, _dateutil_tzlocal) | |||
|
|||
|
|||
cpdef object get_tzlocal_tz(object tz): | |||
global tzlocal_tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this to prevent import issues?
@qwhelan can you merge master and address build failures? |
@@ -0,0 +1,5 @@ | |||
import sys |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
call this pandas/_vendored/
(explicity private)
can you merge master and update |
1 similar comment
can you merge master and update |
closing as stale. nice idea though, pls ping if you can continue. |
This is a follow-up to #24491, where it was noted that
tzlocal
operations are notably slower:This PR makes them indistinguishable from other timezones:
To do this, we add a new optional dependency:
tzlocal
. This package queries the OS to try and identify apytz
compatible timezone name for the local timezone. If this is successful, we can use thepytz
timezone instead ofdateutil.tz.tzlocal()
, yielding a huge speedup as the latter invokesdatetime.datetime()
. In cases where the local timezone is also UTC, further speedups are possible.In all cases, the
tzlocal()
object is maintained as the.tz
to preserve any semantic meaning.One unresolved issue is an incompatibility with
python-dateutil
versions prior to2.6.0
. I'm unable to reproduce the error locally, but can repeatably get this error on Travis:While this only shows up on the 2.7 build, I believe this is due to it being the only one actually building against
python-dateutil=2.5.0
. Any version>=2.6.0
does not exhibit this error.Finally, here is the
asv
benchmark againstv0.24.0rc1
:git diff upstream/master -u -- "*.py" | flake8 --diff