Skip to content

tz_compare fails to consider string "UTC" and pytz UTC object equal #23959

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
TomAugspurger opened this issue Nov 27, 2018 · 9 comments · Fixed by #39216
Closed

tz_compare fails to consider string "UTC" and pytz UTC object equal #23959

TomAugspurger opened this issue Nov 27, 2018 · 9 comments · Fixed by #39216
Labels
Timezones Timezone data dtype
Milestone

Comments

@TomAugspurger
Copy link
Contributor

Nothing user-facing here, but this surprised me.

In [5]: pd._libs.tslibs.timezones.tz_compare("UTC", pytz.timezone("UTC"))
Out[5]: False

Other timezones behave as expected

In [6]: pd._libs.tslibs.timezones.tz_compare("US/Eastern", pytz.timezone("US/Eastern"))
Out[6]: True

Is this user error, or a bug?

@TomAugspurger TomAugspurger added the Timezones Timezone data dtype label Nov 27, 2018
@mroeschke
Copy link
Member

This slipped through the cracks when I did some UTC refactoring. We don't rely on "UTC" as a string internally anymore but this may as well work since "US/Eastern". I think this is not getting cast by pytz. I'll put up a PR tonight.

@mroeschke
Copy link
Member

QQ: Would we expect tz_compare("UTC", dateutil.tz.UTC) to return True? How about tz_compare(pytz.timezone("UTC"), dateutil.tz.UTC)?

@TomAugspurger
Copy link
Contributor Author

I'm not sure, but it probably depends on exactly how we're using tz_compare. I was thinking of it as "are these two timezones the same?" which is a fuzzy concept, but in this case I would say that they're the same.

The docs say we compare "string representations", and they're clearly different

In [10]: str(dateutil.tz.UTC)
Out[10]: 'tzutc()'

In [11]: str(pytz.timezone("UTC"))
Out[11]: 'UTC'

cc @pganssle if you have thoughts.

@pganssle
Copy link
Contributor

pganssle commented Dec 3, 2018

Sorry, I have somewhat let this fall by the wayside.

Unfortunately, I don't know of any reasonably generic way to compare time zones, and the semantics of timezone aware datetime comparison unfortunately use is rather than __eq__.

The reason for tz_compare is because pandas is working around pytz's workaround for a lack of PEP 495. pytz works by assigning fixed offsets when the tzinfo is attached, which means that you have this:

NYC = pytz.timezone('America/New_York')
dt1 = NYC.localize(datetime(2018, 1, 1))
dt2 = NYC.localize(datetime(2018, 6, 1))
print(dt1.tzinfo is dt2.tzinfo)
# False

I am not aware of whether pytz defines __eq__ for its time zones, but I think there is some reason you cannot just check direct equality of pytz time zones. In any case, the reason why tz_compare exists is that str(dt.tzinfo) for a pytz zone will always return the IANA key for IANA zones independent of which offset is applied, so this is a proxy for for what you really want, which is "Does this offset represent the same time zone".

In any case, you may be able to special-case some logic related to UTC, but unfortunately the semantics are already messed up and you may in fact be creating more weird edge cases and inconsistencies by trying to patch in compatibility between time zone providers, or to try to support string-to-timezone comparisons directly. Hopefully we will be making the str representations of dateutil time zones nicer at some time in the future, but I wouldn't count on it.

@pganssle
Copy link
Contributor

pganssle commented Dec 3, 2018

By the way, my solution to this that should unravel most of the complicated stuff you are doing around time zones is to create your own IANA time zone implementation. The two representations you are relying on are not designed for your use case, and as a result you make ample use of private methods, which is a much higher maintenance burden than just creating your own time zone implementation.

I think you should somewhat reliably be able to create a mapping between input time zones and the IANA zones they map to, so the extent of your support for third party time zones would be to immediately convert them to an internal pandas TZ representation. This would allow you to drop pytz as a dependency as well.

@jreback
Copy link
Contributor

jreback commented Dec 3, 2018

@pganssle

to the extent of your support for third party time zones would be to immediately convert them to an internal pandas TZ representation. This would allow you to drop pytz as a dependency as well.

what would this look like?

@pganssle
Copy link
Contributor

pganssle commented Dec 3, 2018

what would this look like?

Taking some time to sketch out some sort of prototype is on my list, but unfortunately it's not high on my list. I think the high level overview is:

  1. I resolve Move zoneinfo data to a separate package dateutil/dateutil#701, so that the dateutil zoneinfo file can be a separate dependency of pandas (I think pendulum has a package that's just the tz data, too, but I don't know if it's suitable for general use).
  2. pandas adds a parser for zoneinfo files (this is not terribly difficult to do, especially in Cython
  3. pandas adds a class equivalent to tzinfo (doesn't necessarily have to be tzinfo) that expects to operate on datetime array types, and a function mapping IANA keys to the relevant PandasTzInfo

For strings, the support is built in already. For pytz you can already get the IANA key from any pytz zone by calling str on it, for dateutil it's not possible yet, but that will change soon (PR welcome).

There will be time zones that cannot be translated in this way, but I don't see that this is a big deal, because this is "fast path" time zone handling code. You can have two levels of support. UTC, fixed offset time zones and IANA zones represented as strings, pytz or dateutil types get translated automatically into the internal types when applied to a TZ-aware column. Everything else goes down the "slow path", which is the equivalent of having an object column - time zone handling code just calls the underlying datetime object's time zone functions in a for loop.

If done right, this doesn't actually require a dependency on pytz, because you can either use heuristics to detect pytz zones (or other clever methods to avoid unnecessary imports of pytz).

Most of this does not have to be a breaking change, though dropping pytz will likely need to be a breaking change. That is not a huge part of it, and depending on how strongly you feel about a notice period, I have a few strategies you can employ that would maintain API compatibility but slowly transition over to dateutil.

@jbrockmendel
Copy link
Member

This is partially fixed as tz_compare now requires tzinfo objects, so the string "UTC" is irrelevant. It's still questionable that tz_compare(pytz.UTC, timezone.utc) gives False.

@andrewcooke
Copy link

questionable? it just wasted yet another day in my life.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants