API/BUG: Enforce "normalized" pytz timezones for DatetimeIndex #20510

mroeschke · 2018-03-28T03:19:52Z

closes #3746
closes #18595
closes #13238

tests added / passed
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

Addressing 3 birds with 2 stones here. Using @pganssle suggested implementation for a tz property and directly raising an error per #3746 (could depreciate and error in a future version as well, open to feedback on the prefered path of API change)

Additionally, solves the issue of resampling a DataFrame/Series with a DatetimeIndex that retained a local timezone instead of the "LMT" version.

Remnant of pandas-devgh-20249.

jreback

looks good! some comments

jreback · 2018-03-28T10:15:06Z

doc/source/whatsnew/v0.23.0.txt

@@ -916,6 +918,7 @@ Timezones
 - Bug in :func:`DataFrame.diff` that raised an ``IndexError`` with tz-aware values (:issue:`18578`)
 - Bug in :func:`melt` that converted tz-aware dtypes to tz-naive (:issue:`15785`)
 - Bug in :func:`Dataframe.count` that raised an ``ValueError`` if .dropna() method is invoked for single column timezone-aware values. (:issue:`13407`)
+- Bug in :func:`Dataframe.resample` that dropped timezone information (:issue:`13238`)


can you move to resample section

jreback · 2018-03-28T10:19:23Z

pandas/_libs/tslibs/timezones.pyx

@@ -314,3 +314,21 @@ cpdef bint tz_compare(object start, object end):
    """
    # GH 18523
    return get_timezone(start) == get_timezone(end)
+
+
+cpdef tz_normalize(object tz):


this is not a great name, normalize has a specific meaning to timestamps. maybe tz_standardize?

jreback · 2018-03-28T10:19:38Z

pandas/_libs/tslibs/timezones.pyx

+    version
+
+    Parameters
+    ----------


can you add some examples

jreback · 2018-03-28T10:21:19Z

pandas/tests/indexes/datetimes/test_construction.py

+            dti.tz = pytz.timezone('US/Pacific')
+
+    @pytest.mark.parametrize('tz', [
+        None, 'America/Los_Angeles', pytz.timezone('America/Los_Angeles'),


can you add a similar battery of these tests to Timestamp as well (disabled setting already is disabled, but see if we have a test for it).

pganssle · 2018-03-28T11:29:33Z

pandas/core/indexes/datetimelike.py

@@ -1005,7 +1005,7 @@ def shift(self, n, freq=None):
            result = self + offset

            if hasattr(self, 'tz'):


I generally avoid hasattr in projects that support Python 2.

yeah this impl is shared by DTI and TDI so this could be restructured a bit. Please file an issue .

pganssle · 2018-03-28T11:34:42Z

I'm not sure I understand the purpose of making it so you can't set tz.

It's also not certain that you will get LMT for any given zone (though it is likely for certain zones). In general I think people should not use pytz at all, but I'm aware that that might be a pretty big API change. Still, I think any sort of datetime implementation should try to hide pytz's implementation details, not expose them directly.

jreback · 2018-03-28T11:54:46Z

I'm not sure I understand the purpose of making it so you can't set tz.

Timestamp and DTI are immutable objects, so setting after creation doesn't make sense.

jreback · 2018-03-28T11:56:22Z

It's also not certain that you will get LMT for any given zone (though it is likely for certain zones). In general I think people should not use pytz at all, but I'm aware that that might be a pretty big API change. Still, I think any sort of datetime implementation should try to hide pytz's implementation details, not expose them directly.

this is generally true. The actual tz is an implementation detail, which we don't really care / want to expose to the user directly (of course we want them to be able to construct using there favorite tz library & ISO). Users mostly just care about the iso repr anyhow (US/Eastern). After this PR things will more closely follow this pattern.

pganssle · 2018-03-28T12:27:02Z

@jreback Looking more closely, I think I was slightly confused because the comments and documentation are misleading. None of these are testing or ensuring that the tz is "LMT", they are just ensuring that it is a consistent tzinfo object. I think it is best to scrub all mentions of LMT from the comments and documentation.

pganssle · 2018-03-28T12:28:49Z

pandas/tests/indexes/datetimes/test_construction.py

+        # GH 18595
+        non_norm_tz = Timestamp('2010', tz=tz).tz
+        result = DatetimeIndex(['2010'], tz=non_norm_tz)
+        assert pytz.timezone(tz) == result.tz


Timezone comparison should use is, not ==. See this post about aware datetime comparison (or this one about aware datetime semantics in general).

mroeschke · 2018-03-28T16:55:32Z

@pganssle Yeah I was mainly using LMT as an alias (and consistent pytz basis) for "the first tzinfo object returned from pytz.timezone(...)" since I thought it would always be LMT. Since that may not be the case, agreed that I should remove mention of LMT to avoid confusion

pganssle · 2018-03-28T19:28:01Z

@mroeschke I don't think "the first tzinfo object" is really the contract either. The point is that tz should represent the canonical time zone object, to the extent that there is one (any one of them can be chosen as a representative, so long as it's always the same one).

That said, exposing that information anywhere in the public-facing documentation is probably a bad idea, because it's very much a pytz concept, and specifically one of the weird ones that violates the tzinfo interface. I think it would be better if pandas started moving away from any overt sign that they use pytz and try to move away from using pytz-specific concepts.

I think a property alone is enough to hide the implementation detail that the time zone object may be different for different datetime objects within the DateTimeIndex.

codecov · 2018-03-29T05:58:20Z

Codecov Report

Merging #20510 into master will decrease coverage by 0.02%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #20510      +/-   ##
==========================================
- Coverage   91.84%   91.82%   -0.03%     
==========================================
  Files         153      153              
  Lines       49256    49257       +1     
==========================================
- Hits        45241    45230      -11     
- Misses       4015     4027      +12

Flag	Coverage Δ
#multiple	`90.21% <100%> (-0.03%)`	⬇️
#single	`41.91% <63.63%> (ø)`	⬆️

Impacted Files	Coverage Δ
pandas/core/indexes/datetimelike.py	`96.72% <100%> (ø)`	⬆️
pandas/core/indexes/datetimes.py	`95.73% <100%> (ø)`	⬆️
pandas/plotting/_converter.py	`65.07% <0%> (-1.74%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update aa3fefc...67a29d5. Read the comment docs.

pganssle · 2018-03-29T12:56:37Z

pandas/_libs/tslibs/timezones.pyx

@@ -316,10 +316,10 @@ cpdef bint tz_compare(object start, object end):
    return get_timezone(start) == get_timezone(end)


-cpdef tz_normalize(object tz):
+cpdef tz_standardize(object tz):


Does this need to be a cpdef? Can it just be cdef? I don't know why end users would need to be able to do this.

This function is called within a python file (pandas/core/indexes/datetimelike.py), so it can't just be a cdef

mroeschke · 2018-03-30T02:18:09Z

Addressed your comments @jreback.

jreback

small changes.

jreback · 2018-03-30T19:31:10Z

pandas/tests/frame/test_alter_axes.py

@@ -250,7 +250,7 @@ def test_set_index_cast_datetimeindex(self):
        df['C'] = i.to_series().reset_index(drop=True)
        result = df['C']
        comp = pd.DatetimeIndex(expected.values).copy()


can remove the .copy() here

jreback · 2018-03-31T16:50:55Z

note that ALL functions really should have documentation. private ones as well as public ones. We have been making efforts on this recently. It is just useful for future readers.

pganssle · 2018-03-31T16:57:13Z

Sure, but the same could be said for dateutil.parser._timelex, the undocumented private class not in __all__, but you were the one who persuaded me to add a deprecation warning to that. I'm suggesting that these things go much easier when the interpreter literally won't let you access the symbols (though this is not always possible).

jorisvandenbossche · 2018-04-01T12:18:42Z

pandas/_libs/tslibs/timestamps.pyx

@@ -700,6 +700,12 @@ class Timestamp(_Timestamp):
        """
        return self.tzinfo

+    @tz.setter
+    def tz(self, value):
+        # GH 3746


Can you put here a more informative comment instead of referring to the issue (if it is needed of course)?
I think the only reason to have this is to have a more informative error message?

jorisvandenbossche · 2018-04-01T12:19:39Z

pandas/core/indexes/datetimes.py

@@ -684,6 +678,17 @@ def _values(self):
        else:
            return self.values

+    @property
+    def tz(self):
+        # GH 18595


jorisvandenbossche · 2018-04-01T12:39:51Z

things go much easier when the interpreter literally won't let you access the symbols (though this is not always possible)

that's not possible in this case, as we need to use tz_standardize (internally) in a python file

Not sure what part you are confused about, but my intention was that tz_standardize should be as private as possible, not part of the public interface. Having time zone accessed exclusively through a tz property should be sufficient to achieve this

Confusion solved :) My starting point was that only .tz was already the only public exposure.

That said, .tz is public, and returns a pytz instance:

I think it would be better if pandas started moving away from any overt sign that they use pytz and try to move away from using pytz-specific concepts.

I suppose this would only be solved by no longer returning a pytz instance? This could potentially break peoples codes that use a pytz-specific method of the tz object?

pganssle · 2018-04-01T17:24:00Z

that's not possible in this case, as we need to use tz_standardize (internally) in a python file

This is not quite true. In tslibs you can do this:

cdef tz_standardize(tz):
    ...

cpdef get_tz(self):
    return tz_standardize(self._tz)

And then you can just do:

class DateTimeIndex:
    tz = property(get_tz)

This is one way to do it (might actually be faster because presumably the tz_standarize call is inlined into get_tz). There are other mechanisms that prevent exposing the tz_standardize symbol, but I'm not terribly worried about it honestly.

I suppose this would only be solved by no longer returning a pytz instance? This could potentially break peoples codes that use a pytz-specific method of the tz object?

It doesn't actually return a pytz instance, it returns whatever tzinfo object you attached. If you use a dateutil timezone, it should return a dateutil timezone. If you're going to move away from pytz, I would change it so that passing an IANA timezone string to tz_localize attaches a dateutil zone, not a pytz zone, but as you mention that would probably be a breaking change.

Deprecating that behavior would be tricky, though. The best option for deprecation I think would be to create some sort of mock object that wraps pytz time zones and raises DeprecationWarning whenever any of the pytz-specific methods are called.

mroeschke · 2018-04-02T16:42:51Z

@jorisvandenbossche address your comment and all green.

jreback

lgtm. can u run a quick asv for this
should be ok but doing a bit more work when returning tz

mroeschke · 2018-04-03T06:15:25Z

I'll try to get those ASV sometime tomorrow; my asv setup still doesn't work and need to look into it.

mroeschke · 2018-04-04T06:30:38Z

I wasn't able to resolve my asv issues, but here's a timeit in the meantime. There appears to be a noticeable slowdown.

In [1]: import pandas as pd

In [2]: dti = pd.DatetimeIndex([pd.Timestamp('2010', tz='US/Pacific')])

# This branch
In [3]: %timeit dti.tz
2.21 µs ± 25.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

#Master
In [4]: %timeit dti.tz
52.8 ns ± 1.37 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

pganssle · 2018-04-04T13:40:46Z

@mroeschke Have you tried using the mechanism I mentioned in this comment, using tz_standardize as a cdef? That may end up being faster because it saves on Python function call overhead, but I'm not sure how much faster.

That said, it seems unlikely that anyone is accessing the tz property in a tight loop. It's an operation that happens once per index. I suspect spending too much time optimizing it might be a waste of time.

pganssle · 2018-04-04T15:03:07Z

Actually, just tried it, a lot of the "slow" part is converting to the "canonical" pytz zone. It's much faster if you tz refers to a dateutil time zone.

Using a pytz zone:

# master branch
In [4]: %timeit dti.tz
69.8 ns ± 0.826 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

# cdef tz_standardize
#master
In [8]: %timeit dti.tz
69.5 ns ± 1.25 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [3]: %timeit dti.tz
2.41 µs ± 11.2 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

# cpdef tz_standardize
In [6]: %timeit dti.tz
2.79 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

And for a dateutil zone:

# cdef tz_standardize
In [25]: %timeit dti.tz
478 ns ± 31.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# cpdef tz_standardize
In [4]: %timeit dti.tz
691 ns ± 4.62 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Another 190 ns is explained by using a property at all (and thus using the descriptor protocol). When tz is set to return self._tz, we get:

In [4]: %timeit dti.tz
190 ns ± 1.11 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

I think the remaining timing is just the time it takes to determine if something is "pytz-like". That said, there's not much you can do about that, particularly since most people will be using pytz until and unless pandas starts to move towards using dateutil zones.

mroeschke · 2018-04-04T16:34:48Z

Thanks for the investigation @pganssle! I am getting similar results for the get_tz property implementation as the one on this branch.

In [1]: import pandas as pd

In [2]:  dti = pd.DatetimeIndex([pd.Timestamp('2010', tz='US/Pacific')])

# tz = property(get_tz)
In [3]: %timeit dti.tz
2.43 µs ± 153 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

For a dateutil tz, this should just some overhead with a hasattr check, but you're probably correct that for pytz, a good amount of time is spent constructing the new pytz.

But overall I agree with you that I do not suspect many accessing .tz in a tight loop.

pganssle · 2018-04-04T16:49:40Z

For a dateutil tz, this should just some overhead with a hasattr check, but you're probably correct that for pytz, a good amount of time is spent constructing the new pytz.

It's more than just that, it's also the function call overhead. To the extent that %timeit is reliable, inlining the extra call to tz_standardize does provide a reasonable (looks like it could be about 30%) speedup when using dateutil zones.

jreback · 2018-04-04T17:55:27Z

so what u need to do here is use the cache_only decorator
since these are immutable once constructed they r good

jreback · 2018-04-04T20:01:31Z

pandas/core/indexes/datetimes.py

@@ -684,6 +678,11 @@ def _values(self):
        else:
            return self.values

+    @cache_readonly
+    def tz(self):
+        # GH 18595


i believe we now have this in setter / getter versions

Hmm, the only caching decorator I could find was implemented here:

pandas/pandas/_libs/properties.pyx

Lines 9 to 44 in c7af4ae

cdef class CachedProperty(object):

cdef readonly:

object func, name, __doc__

def __init__(self, func):

self.func = func

self.name = func.__name__

self.__doc__ = getattr(func, '__doc__', None)

def __get__(self, obj, typ):

if obj is None:

# accessed on the class, not the instance

return self

# Get the cache or set a default one if needed

cache = getattr(obj, '_cache', None)

if cache is None:

try:

cache = obj._cache = {}

except (AttributeError):

return self

if PyDict_Contains(cache, self.name):

# not necessary to Py_INCREF

val = <object> PyDict_GetItem(cache, self.name)

else:

val = self.func(obj)

PyDict_SetItem(cache, self.name, val)

return val

def __set__(self, obj, value):

raise AttributeError("Can't set attribute")

cache_readonly = CachedProperty

jreback · 2018-04-05T15:30:06Z

will this PR still work if instead of calling tz_standarize on the .tz, you instead set the ._tz with the standardized value? this sidesteps the caching as its not longer needed then.

mroeschke · 2018-04-06T05:26:19Z

Good call @jreback. Additionally, standardizing ._tz directly allows us to provide a better error message when someone attempts to set .tz.

All green

jreback · 2018-04-11T02:30:02Z

very nice @mroeschke ! keep em coming!

noemielteto and others added 11 commits March 20, 2018 18:38

DOC: update the Index.isin docstring (pandas-dev#20249)

b878540

MAINT: Remove weird pd file

0b0fb83

Remnant of pandas-devgh-20249.

BUG: Retain timezone when resampling

8316501

Normalize pytz timezone for Datetimeindexes

0bb1b13

Merge remote-tracking branch 'upstream/master' into resample_timezone

4d6f0d1

Change construction of tz

4a202b0

Merge remote-tracking branch 'upstream/master' into resample_timezone

29560c4

lint and add whatsnew

87cacf8

Merge remote-tracking branch 'upstream/master' into resample_timezone

e8990fc

Adjust whatsnew and add additional test

b1f2724

Adjust test

c1241f9

jreback added Bug Timezones Timezone data dtype labels Mar 28, 2018

jreback requested changes Mar 28, 2018

View reviewed changes

jreback added the Resample resample method label Mar 28, 2018

pganssle reviewed Mar 28, 2018

View reviewed changes

mroeschke added 2 commits March 28, 2018 21:10

Address review and failing CI test

9833f01

Merge remote-tracking branch 'upstream/master' into resample_timezone

43fab89

pganssle reviewed Mar 29, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into resample_timezone

f1a5ca7

jreback requested changes Mar 30, 2018

View reviewed changes

mroeschke added 2 commits March 31, 2018 12:26

Merge remote-tracking branch 'upstream/master' into resample_timezone

464a91b

add same tz error to timestamp

c1db598

jorisvandenbossche reviewed Apr 1, 2018

View reviewed changes

mroeschke added 2 commits April 1, 2018 22:21

Merge remote-tracking branch 'upstream/master' into resample_timezone

bf1ec9e

Add description of issue

81ccb21

jreback requested changes Apr 2, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into resample_timezone

12f697b

Merge remote-tracking branch 'upstream/master' into resample_timezone

867ef19

Use cache_readonly

360c295

jreback reviewed Apr 4, 2018

View reviewed changes

standardize ._tz directly

67a29d5

jreback added this to the 0.23.0 milestone Apr 11, 2018

jreback approved these changes Apr 11, 2018

View reviewed changes

jreback merged commit fa24af9 into pandas-dev:master Apr 11, 2018

mroeschke deleted the resample_timezone branch April 11, 2018 17:10

		@@ -1005,7 +1005,7 @@ def shift(self, n, freq=None):
		result = self + offset

		if hasattr(self, 'tz'):

	cdef class CachedProperty(object):

	cdef readonly:
	object func, name, __doc__

	def __init__(self, func):
	self.func = func
	self.name = func.__name__
	self.__doc__ = getattr(func, '__doc__', None)

	def __get__(self, obj, typ):
	if obj is None:
	# accessed on the class, not the instance
	return self

	# Get the cache or set a default one if needed
	cache = getattr(obj, '_cache', None)
	if cache is None:
	try:
	cache = obj._cache = {}
	except (AttributeError):
	return self

	if PyDict_Contains(cache, self.name):
	# not necessary to Py_INCREF
	val = <object> PyDict_GetItem(cache, self.name)
	else:
	val = self.func(obj)
	PyDict_SetItem(cache, self.name, val)
	return val

	def __set__(self, obj, value):
	raise AttributeError("Can't set attribute")


	cache_readonly = CachedProperty

API/BUG: Enforce "normalized" pytz timezones for DatetimeIndex #20510

API/BUG: Enforce "normalized" pytz timezones for DatetimeIndex #20510

Conversation

mroeschke commented Mar 28, 2018 • edited by jreback Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pganssle commented Mar 28, 2018

jreback commented Mar 28, 2018

jreback commented Mar 28, 2018

pganssle commented Mar 28, 2018

Choose a reason for hiding this comment

mroeschke commented Mar 28, 2018

pganssle commented Mar 28, 2018 • edited Loading

codecov bot commented Mar 29, 2018 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mroeschke commented Mar 30, 2018 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Mar 31, 2018

pganssle commented Mar 31, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jorisvandenbossche commented Apr 1, 2018

pganssle commented Apr 1, 2018 • edited Loading

mroeschke commented Apr 2, 2018

jreback left a comment

Choose a reason for hiding this comment

mroeschke commented Apr 3, 2018

mroeschke commented Apr 4, 2018

pganssle commented Apr 4, 2018

pganssle commented Apr 4, 2018

mroeschke commented Apr 4, 2018

pganssle commented Apr 4, 2018

jreback commented Apr 4, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Apr 5, 2018

mroeschke commented Apr 6, 2018 • edited Loading

jreback commented Apr 11, 2018

mroeschke commented Mar 28, 2018 •

edited by jreback

Loading

pganssle commented Mar 28, 2018 •

edited

Loading

codecov bot commented Mar 29, 2018 •

edited

Loading

mroeschke commented Mar 30, 2018 •

edited

Loading

pganssle commented Apr 1, 2018 •

edited

Loading

mroeschke commented Apr 6, 2018 •

edited

Loading