-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
API/BUG: Enforce "normalized" pytz timezones for DatetimeIndex #20510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Remnant of pandas-devgh-20249.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good! some comments
doc/source/whatsnew/v0.23.0.txt
Outdated
@@ -916,6 +918,7 @@ Timezones | |||
- Bug in :func:`DataFrame.diff` that raised an ``IndexError`` with tz-aware values (:issue:`18578`) | |||
- Bug in :func:`melt` that converted tz-aware dtypes to tz-naive (:issue:`15785`) | |||
- Bug in :func:`Dataframe.count` that raised an ``ValueError`` if .dropna() method is invoked for single column timezone-aware values. (:issue:`13407`) | |||
- Bug in :func:`Dataframe.resample` that dropped timezone information (:issue:`13238`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you move to resample section
pandas/_libs/tslibs/timezones.pyx
Outdated
@@ -314,3 +314,21 @@ cpdef bint tz_compare(object start, object end): | |||
""" | |||
# GH 18523 | |||
return get_timezone(start) == get_timezone(end) | |||
|
|||
|
|||
cpdef tz_normalize(object tz): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not a great name, normalize has a specific meaning to timestamps. maybe tz_standardize
?
pandas/_libs/tslibs/timezones.pyx
Outdated
version | ||
|
||
Parameters | ||
---------- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add some examples
dti.tz = pytz.timezone('US/Pacific') | ||
|
||
@pytest.mark.parametrize('tz', [ | ||
None, 'America/Los_Angeles', pytz.timezone('America/Los_Angeles'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a similar battery of these tests to Timestamp as well (disabled setting already is disabled, but see if we have a test for it).
@@ -1005,7 +1005,7 @@ def shift(self, n, freq=None): | |||
result = self + offset | |||
|
|||
if hasattr(self, 'tz'): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I generally avoid hasattr
in projects that support Python 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this impl is shared by DTI and TDI so this could be restructured a bit. Please file an issue .
I'm not sure I understand the purpose of making it so you can't set It's also not certain that you will get |
Timestamp and DTI are immutable objects, so setting after creation doesn't make sense. |
this is generally true. The actual tz is an implementation detail, which we don't really care / want to expose to the user directly (of course we want them to be able to construct using there favorite tz library & ISO). Users mostly just care about the iso repr anyhow (US/Eastern). After this PR things will more closely follow this pattern. |
@jreback Looking more closely, I think I was slightly confused because the comments and documentation are misleading. None of these are testing or ensuring that the tz is "LMT", they are just ensuring that it is a consistent |
# GH 18595 | ||
non_norm_tz = Timestamp('2010', tz=tz).tz | ||
result = DatetimeIndex(['2010'], tz=non_norm_tz) | ||
assert pytz.timezone(tz) == result.tz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Timezone comparison should use is
, not ==
. See this post about aware datetime comparison (or this one about aware datetime semantics in general).
@pganssle Yeah I was mainly using |
@mroeschke I don't think "the first tzinfo object" is really the contract either. The point is that That said, exposing that information anywhere in the public-facing documentation is probably a bad idea, because it's very much a I think a property alone is enough to hide the implementation detail that the time zone object may be different for different |
Codecov Report
@@ Coverage Diff @@
## master #20510 +/- ##
==========================================
- Coverage 91.84% 91.82% -0.03%
==========================================
Files 153 153
Lines 49256 49257 +1
==========================================
- Hits 45241 45230 -11
- Misses 4015 4027 +12
Continue to review full report at Codecov.
|
pandas/_libs/tslibs/timezones.pyx
Outdated
@@ -316,10 +316,10 @@ cpdef bint tz_compare(object start, object end): | |||
return get_timezone(start) == get_timezone(end) | |||
|
|||
|
|||
cpdef tz_normalize(object tz): | |||
cpdef tz_standardize(object tz): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a cpdef
? Can it just be cdef
? I don't know why end users would need to be able to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called within a python file (pandas/core/indexes/datetimelike.py
), so it can't just be a cdef
Addressed your comments @jreback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small changes.
@@ -250,7 +250,7 @@ def test_set_index_cast_datetimeindex(self): | |||
df['C'] = i.to_series().reset_index(drop=True) | |||
result = df['C'] | |||
comp = pd.DatetimeIndex(expected.values).copy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove the .copy() here
note that ALL functions really should have documentation. private ones as well as public ones. We have been making efforts on this recently. It is just useful for future readers. |
Sure, but the same could be said for |
pandas/_libs/tslibs/timestamps.pyx
Outdated
@@ -700,6 +700,12 @@ class Timestamp(_Timestamp): | |||
""" | |||
return self.tzinfo | |||
|
|||
@tz.setter | |||
def tz(self, value): | |||
# GH 3746 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put here a more informative comment instead of referring to the issue (if it is needed of course)?
I think the only reason to have this is to have a more informative error message?
@@ -684,6 +678,17 @@ def _values(self): | |||
else: | |||
return self.values | |||
|
|||
@property | |||
def tz(self): | |||
# GH 18595 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here
that's not possible in this case, as we need to use
Confusion solved :) My starting point was that only That said,
I suppose this would only be solved by no longer returning a pytz instance? This could potentially break peoples codes that use a pytz-specific method of the |
This is not quite true. In cdef tz_standardize(tz):
...
cpdef get_tz(self):
return tz_standardize(self._tz) And then you can just do: class DateTimeIndex:
tz = property(get_tz) This is one way to do it (might actually be faster because presumably the
It doesn't actually return a Deprecating that behavior would be tricky, though. The best option for deprecation I think would be to create some sort of mock object that wraps |
@jorisvandenbossche address your comment and all green. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. can u run a quick asv for this
should be ok but doing a bit more work when returning tz
I'll try to get those ASV sometime tomorrow; my asv setup still doesn't work and need to look into it. |
I wasn't able to resolve my asv issues, but here's a timeit in the meantime. There appears to be a noticeable slowdown.
|
@mroeschke Have you tried using the mechanism I mentioned in this comment, using That said, it seems unlikely that anyone is accessing the |
Actually, just tried it, a lot of the "slow" part is converting to the "canonical" Using a # master branch
In [4]: %timeit dti.tz
69.8 ns ± 0.826 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
# cdef tz_standardize
#master
In [8]: %timeit dti.tz
69.5 ns ± 1.25 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
In [3]: %timeit dti.tz
2.41 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
# cpdef tz_standardize
In [6]: %timeit dti.tz
2.79 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) And for a # cdef tz_standardize
In [25]: %timeit dti.tz
478 ns ± 31.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
# cpdef tz_standardize
In [4]: %timeit dti.tz
691 ns ± 4.62 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) Another 190 ns is explained by using a property at all (and thus using the descriptor protocol). When In [4]: %timeit dti.tz
190 ns ± 1.11 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) I think the remaining timing is just the time it takes to determine if something is "pytz-like". That said, there's not much you can do about that, particularly since most people will be using |
Thanks for the investigation @pganssle! I am getting similar results for the
For a dateutil tz, this should just some overhead with a But overall I agree with you that I do not suspect many accessing |
It's more than just that, it's also the function call overhead. To the extent that |
so what u need to do here is use the cache_only decorator |
pandas/core/indexes/datetimes.py
Outdated
@@ -684,6 +678,11 @@ def _values(self): | |||
else: | |||
return self.values | |||
|
|||
@cache_readonly | |||
def tz(self): | |||
# GH 18595 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe we now have this in setter / getter versions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, the only caching decorator I could find was implemented here:
pandas/pandas/_libs/properties.pyx
Lines 9 to 44 in c7af4ae
cdef class CachedProperty(object): | |
cdef readonly: | |
object func, name, __doc__ | |
def __init__(self, func): | |
self.func = func | |
self.name = func.__name__ | |
self.__doc__ = getattr(func, '__doc__', None) | |
def __get__(self, obj, typ): | |
if obj is None: | |
# accessed on the class, not the instance | |
return self | |
# Get the cache or set a default one if needed | |
cache = getattr(obj, '_cache', None) | |
if cache is None: | |
try: | |
cache = obj._cache = {} | |
except (AttributeError): | |
return self | |
if PyDict_Contains(cache, self.name): | |
# not necessary to Py_INCREF | |
val = <object> PyDict_GetItem(cache, self.name) | |
else: | |
val = self.func(obj) | |
PyDict_SetItem(cache, self.name, val) | |
return val | |
def __set__(self, obj, value): | |
raise AttributeError("Can't set attribute") | |
cache_readonly = CachedProperty |
will this PR still work if instead of calling |
Good call @jreback. Additionally, standardizing All green |
very nice @mroeschke ! keep em coming! |
closes #3746
closes #18595
closes #13238
git diff upstream/master -u -- "*.py" | flake8 --diff
Addressing 3 birds with 2 stones here. Using @pganssle suggested implementation for a tz property and directly raising an error per #3746 (could depreciate and error in a future version as well, open to feedback on the prefered path of API change)
Additionally, solves the issue of resampling a DataFrame/Series with a DatetimeIndex that retained a local timezone instead of the "LMT" version.