Skip to content

API/BUG: Enforce "normalized" pytz timezones for DatetimeIndex #20510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 24 commits into from
Apr 11, 2018

Conversation

mroeschke
Copy link
Member

@mroeschke mroeschke commented Mar 28, 2018

closes #3746
closes #18595
closes #13238

  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

Addressing 3 birds with 2 stones here. Using @pganssle suggested implementation for a tz property and directly raising an error per #3746 (could depreciate and error in a future version as well, open to feedback on the prefered path of API change)

Additionally, solves the issue of resampling a DataFrame/Series with a DatetimeIndex that retained a local timezone instead of the "LMT" version.

@jreback jreback added Bug Timezones Timezone data dtype labels Mar 28, 2018
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good! some comments

@@ -916,6 +918,7 @@ Timezones
- Bug in :func:`DataFrame.diff` that raised an ``IndexError`` with tz-aware values (:issue:`18578`)
- Bug in :func:`melt` that converted tz-aware dtypes to tz-naive (:issue:`15785`)
- Bug in :func:`Dataframe.count` that raised an ``ValueError`` if .dropna() method is invoked for single column timezone-aware values. (:issue:`13407`)
- Bug in :func:`Dataframe.resample` that dropped timezone information (:issue:`13238`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to resample section

@@ -314,3 +314,21 @@ cpdef bint tz_compare(object start, object end):
"""
# GH 18523
return get_timezone(start) == get_timezone(end)


cpdef tz_normalize(object tz):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not a great name, normalize has a specific meaning to timestamps. maybe tz_standardize?

version

Parameters
----------
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add some examples

dti.tz = pytz.timezone('US/Pacific')

@pytest.mark.parametrize('tz', [
None, 'America/Los_Angeles', pytz.timezone('America/Los_Angeles'),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a similar battery of these tests to Timestamp as well (disabled setting already is disabled, but see if we have a test for it).

@jreback jreback added the Resample resample method label Mar 28, 2018
@@ -1005,7 +1005,7 @@ def shift(self, n, freq=None):
result = self + offset

if hasattr(self, 'tz'):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I generally avoid hasattr in projects that support Python 2.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this impl is shared by DTI and TDI so this could be restructured a bit. Please file an issue .

@pganssle
Copy link
Contributor

I'm not sure I understand the purpose of making it so you can't set tz.

It's also not certain that you will get LMT for any given zone (though it is likely for certain zones). In general I think people should not use pytz at all, but I'm aware that that might be a pretty big API change. Still, I think any sort of datetime implementation should try to hide pytz's implementation details, not expose them directly.

@jreback
Copy link
Contributor

jreback commented Mar 28, 2018

I'm not sure I understand the purpose of making it so you can't set tz.

Timestamp and DTI are immutable objects, so setting after creation doesn't make sense.

@jreback
Copy link
Contributor

jreback commented Mar 28, 2018

It's also not certain that you will get LMT for any given zone (though it is likely for certain zones). In general I think people should not use pytz at all, but I'm aware that that might be a pretty big API change. Still, I think any sort of datetime implementation should try to hide pytz's implementation details, not expose them directly.

this is generally true. The actual tz is an implementation detail, which we don't really care / want to expose to the user directly (of course we want them to be able to construct using there favorite tz library & ISO). Users mostly just care about the iso repr anyhow (US/Eastern). After this PR things will more closely follow this pattern.

@pganssle
Copy link
Contributor

@jreback Looking more closely, I think I was slightly confused because the comments and documentation are misleading. None of these are testing or ensuring that the tz is "LMT", they are just ensuring that it is a consistent tzinfo object. I think it is best to scrub all mentions of LMT from the comments and documentation.

# GH 18595
non_norm_tz = Timestamp('2010', tz=tz).tz
result = DatetimeIndex(['2010'], tz=non_norm_tz)
assert pytz.timezone(tz) == result.tz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke
Copy link
Member Author

@pganssle Yeah I was mainly using LMT as an alias (and consistent pytz basis) for "the first tzinfo object returned from pytz.timezone(...)" since I thought it would always be LMT. Since that may not be the case, agreed that I should remove mention of LMT to avoid confusion

@pganssle
Copy link
Contributor

pganssle commented Mar 28, 2018

@mroeschke I don't think "the first tzinfo object" is really the contract either. The point is that tz should represent the canonical time zone object, to the extent that there is one (any one of them can be chosen as a representative, so long as it's always the same one).

That said, exposing that information anywhere in the public-facing documentation is probably a bad idea, because it's very much a pytz concept, and specifically one of the weird ones that violates the tzinfo interface. I think it would be better if pandas started moving away from any overt sign that they use pytz and try to move away from using pytz-specific concepts.

I think a property alone is enough to hide the implementation detail that the time zone object may be different for different datetime objects within the DateTimeIndex.

@codecov
Copy link

codecov bot commented Mar 29, 2018

Codecov Report

Merging #20510 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20510      +/-   ##
==========================================
- Coverage   91.84%   91.82%   -0.03%     
==========================================
  Files         153      153              
  Lines       49256    49257       +1     
==========================================
- Hits        45241    45230      -11     
- Misses       4015     4027      +12
Flag Coverage Δ
#multiple 90.21% <100%> (-0.03%) ⬇️
#single 41.91% <63.63%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/indexes/datetimelike.py 96.72% <100%> (ø) ⬆️
pandas/core/indexes/datetimes.py 95.73% <100%> (ø) ⬆️
pandas/plotting/_converter.py 65.07% <0%> (-1.74%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa3fefc...67a29d5. Read the comment docs.

@@ -316,10 +316,10 @@ cpdef bint tz_compare(object start, object end):
return get_timezone(start) == get_timezone(end)


cpdef tz_normalize(object tz):
cpdef tz_standardize(object tz):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a cpdef? Can it just be cdef? I don't know why end users would need to be able to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is called within a python file (pandas/core/indexes/datetimelike.py), so it can't just be a cdef

@mroeschke
Copy link
Member Author

mroeschke commented Mar 30, 2018

Addressed your comments @jreback.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small changes.

@@ -250,7 +250,7 @@ def test_set_index_cast_datetimeindex(self):
df['C'] = i.to_series().reset_index(drop=True)
result = df['C']
comp = pd.DatetimeIndex(expected.values).copy()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove the .copy() here

@jreback
Copy link
Contributor

jreback commented Mar 31, 2018

note that ALL functions really should have documentation. private ones as well as public ones. We have been making efforts on this recently. It is just useful for future readers.

@pganssle
Copy link
Contributor

Sure, but the same could be said for dateutil.parser._timelex, the undocumented private class not in __all__, but you were the one who persuaded me to add a deprecation warning to that. I'm suggesting that these things go much easier when the interpreter literally won't let you access the symbols (though this is not always possible).

@@ -700,6 +700,12 @@ class Timestamp(_Timestamp):
"""
return self.tzinfo

@tz.setter
def tz(self, value):
# GH 3746
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put here a more informative comment instead of referring to the issue (if it is needed of course)?
I think the only reason to have this is to have a more informative error message?

@@ -684,6 +678,17 @@ def _values(self):
else:
return self.values

@property
def tz(self):
# GH 18595
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@jorisvandenbossche
Copy link
Member

things go much easier when the interpreter literally won't let you access the symbols (though this is not always possible)

that's not possible in this case, as we need to use tz_standardize (internally) in a python file

Not sure what part you are confused about, but my intention was that tz_standardize should be as private as possible, not part of the public interface. Having time zone accessed exclusively through a tz property should be sufficient to achieve this

Confusion solved :) My starting point was that only .tz was already the only public exposure.

That said, .tz is public, and returns a pytz instance:

I think it would be better if pandas started moving away from any overt sign that they use pytz and try to move away from using pytz-specific concepts.

I suppose this would only be solved by no longer returning a pytz instance? This could potentially break peoples codes that use a pytz-specific method of the tz object?

@pganssle
Copy link
Contributor

pganssle commented Apr 1, 2018

that's not possible in this case, as we need to use tz_standardize (internally) in a python file

This is not quite true. In tslibs you can do this:

cdef tz_standardize(tz):
    ...

cpdef get_tz(self):
    return tz_standardize(self._tz)

And then you can just do:

class DateTimeIndex:
    tz = property(get_tz)

This is one way to do it (might actually be faster because presumably the tz_standarize call is inlined into get_tz). There are other mechanisms that prevent exposing the tz_standardize symbol, but I'm not terribly worried about it honestly.

I suppose this would only be solved by no longer returning a pytz instance? This could potentially break peoples codes that use a pytz-specific method of the tz object?

It doesn't actually return a pytz instance, it returns whatever tzinfo object you attached. If you use a dateutil timezone, it should return a dateutil timezone. If you're going to move away from pytz, I would change it so that passing an IANA timezone string to tz_localize attaches a dateutil zone, not a pytz zone, but as you mention that would probably be a breaking change.

Deprecating that behavior would be tricky, though. The best option for deprecation I think would be to create some sort of mock object that wraps pytz time zones and raises DeprecationWarning whenever any of the pytz-specific methods are called.

@mroeschke
Copy link
Member Author

@jorisvandenbossche address your comment and all green.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. can u run a quick asv for this
should be ok but doing a bit more work when returning tz

@mroeschke
Copy link
Member Author

I'll try to get those ASV sometime tomorrow; my asv setup still doesn't work and need to look into it.

@mroeschke
Copy link
Member Author

I wasn't able to resolve my asv issues, but here's a timeit in the meantime. There appears to be a noticeable slowdown.

In [1]: import pandas as pd

In [2]: dti = pd.DatetimeIndex([pd.Timestamp('2010', tz='US/Pacific')])

# This branch
In [3]: %timeit dti.tz
2.21 µs ± 25.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Master
In [4]: %timeit dti.tz
52.8 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

@pganssle
Copy link
Contributor

pganssle commented Apr 4, 2018

@mroeschke Have you tried using the mechanism I mentioned in this comment, using tz_standardize as a cdef? That may end up being faster because it saves on Python function call overhead, but I'm not sure how much faster.

That said, it seems unlikely that anyone is accessing the tz property in a tight loop. It's an operation that happens once per index. I suspect spending too much time optimizing it might be a waste of time.

@pganssle
Copy link
Contributor

pganssle commented Apr 4, 2018

Actually, just tried it, a lot of the "slow" part is converting to the "canonical" pytz zone. It's much faster if you tz refers to a dateutil time zone.

Using a pytz zone:

# master branch
In [4]: %timeit dti.tz
69.8 ns ± 0.826 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# cdef tz_standardize
#master
In [8]: %timeit dti.tz
69.5 ns ± 1.25 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: %timeit dti.tz
2.41 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# cpdef tz_standardize
In [6]: %timeit dti.tz
2.79 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And for a dateutil zone:

# cdef tz_standardize
In [25]: %timeit dti.tz
478 ns ± 31.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# cpdef tz_standardize
In [4]: %timeit dti.tz
691 ns ± 4.62 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Another 190 ns is explained by using a property at all (and thus using the descriptor protocol). When tz is set to return self._tz, we get:

In [4]: %timeit dti.tz
190 ns ± 1.11 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I think the remaining timing is just the time it takes to determine if something is "pytz-like". That said, there's not much you can do about that, particularly since most people will be using pytz until and unless pandas starts to move towards using dateutil zones.

@mroeschke
Copy link
Member Author

Thanks for the investigation @pganssle! I am getting similar results for the get_tz property implementation as the one on this branch.

In [1]: import pandas as pd

In [2]:  dti = pd.DatetimeIndex([pd.Timestamp('2010', tz='US/Pacific')])

# tz = property(get_tz)
In [3]: %timeit dti.tz
2.43 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

For a dateutil tz, this should just some overhead with a hasattr check, but you're probably correct that for pytz, a good amount of time is spent constructing the new pytz.

But overall I agree with you that I do not suspect many accessing .tz in a tight loop.

@pganssle
Copy link
Contributor

pganssle commented Apr 4, 2018

For a dateutil tz, this should just some overhead with a hasattr check, but you're probably correct that for pytz, a good amount of time is spent constructing the new pytz.

It's more than just that, it's also the function call overhead. To the extent that %timeit is reliable, inlining the extra call to tz_standardize does provide a reasonable (looks like it could be about 30%) speedup when using dateutil zones.

@jreback
Copy link
Contributor

jreback commented Apr 4, 2018

so what u need to do here is use the cache_only decorator
since these are immutable once constructed they r good

@@ -684,6 +678,11 @@ def _values(self):
else:
return self.values

@cache_readonly
def tz(self):
# GH 18595
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe we now have this in setter / getter versions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the only caching decorator I could find was implemented here:

cdef class CachedProperty(object):
cdef readonly:
object func, name, __doc__
def __init__(self, func):
self.func = func
self.name = func.__name__
self.__doc__ = getattr(func, '__doc__', None)
def __get__(self, obj, typ):
if obj is None:
# accessed on the class, not the instance
return self
# Get the cache or set a default one if needed
cache = getattr(obj, '_cache', None)
if cache is None:
try:
cache = obj._cache = {}
except (AttributeError):
return self
if PyDict_Contains(cache, self.name):
# not necessary to Py_INCREF
val = <object> PyDict_GetItem(cache, self.name)
else:
val = self.func(obj)
PyDict_SetItem(cache, self.name, val)
return val
def __set__(self, obj, value):
raise AttributeError("Can't set attribute")
cache_readonly = CachedProperty

@jreback
Copy link
Contributor

jreback commented Apr 5, 2018

will this PR still work if instead of calling tz_standarize on the .tz, you instead set the ._tz with the standardized value? this sidesteps the caching as its not longer needed then.

@mroeschke
Copy link
Member Author

mroeschke commented Apr 6, 2018

Good call @jreback. Additionally, standardizing ._tz directly allows us to provide a better error message when someone attempts to set .tz.

All green

@jreback jreback added this to the 0.23.0 milestone Apr 11, 2018
@jreback jreback merged commit fa24af9 into pandas-dev:master Apr 11, 2018
@jreback
Copy link
Contributor

jreback commented Apr 11, 2018

very nice @mroeschke ! keep em coming!

@mroeschke mroeschke deleted the resample_timezone branch April 11, 2018 17:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Resample resample method Timezones Timezone data dtype
Projects
None yet
6 participants