Skip to content

REGR: Index.union should pick correct dtype for combinations of tstamps with different tzones #28034

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
zogzog opened this issue Aug 20, 2019 · 15 comments
Labels
Index Related to the Index class or subclasses Timezones Timezone data dtype

Comments

@zogzog
Copy link

zogzog commented Aug 20, 2019

Here's a pdb session showing the issue:

ipdb> basei
DatetimeIndex(['2017-10-28 23:00:00+00:00', '2017-10-29 00:00:00+00:00',
               '2017-10-29 01:00:00+00:00', '2017-10-29 02:00:00+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)
ipdb> diffi
DatetimeIndex(['2017-10-29 02:00:00+01:00', '2017-10-29 03:00:00+01:00',
               '2017-10-29 04:00:00+01:00', '2017-10-29 05:00:00+01:00'],
              dtype='datetime64[ns, Europe/Paris]', freq=None)
ipdb> basei.union(diffi)
Index([2017-10-28 23:00:00+00:00, 2017-10-29 00:00:00+00:00,
       2017-10-29 01:00:00+00:00, 2017-10-29 02:00:00+00:00,
       2017-10-29 04:00:00+01:00, 2017-10-29 05:00:00+01:00],
      dtype='object')

With pandas 0.24 I got back an index with a dtype='datetime64[ns, UTC]'

This looks a bit like a cousin of #26778

@TomAugspurger
Copy link
Contributor

I think that was a deliberate change: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#incompatible-index-type-unions.

cc @ArtinSarraf if you recall whether mix of datetimetz was discussed.

@TomAugspurger TomAugspurger added Timezones Timezone data dtype Index Related to the Index class or subclasses labels Aug 20, 2019
@zogzog
Copy link
Author

zogzog commented Aug 20, 2019

I've seen the "incompatible types" discussion but I feel like this is going too far. Admittedly this is an edge case (like in the int vs float discussion).

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Aug 20, 2019 via email

@mroeschke
Copy link
Member

I agree that this is a deliberate change in the right direction as to prevent the loss of original timezone information akin to this change back in 0.24.0: https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets

You can always to_datetime(..., utc=True) to coerce your result to UTC.

@zogzog
Copy link
Author

zogzog commented Aug 21, 2019

Timezone information, as far as I am concerned, is a presentation information, a bit like the 'freq' attribute. Two timestamps in differing timezones are fundamentally compatible, and losing timezones is about dropping display information. Are there situations where this would be wrong ?

Another point: it already cost me an hour to write code to paper over the issue. I have a non-indecent test base, but real world tests reveal that I have non-handled edge cases. So it's not just a matter of .tz_converting the offending index and the cost of reimplementing myself what .union did previously is going to mount. In other words, this is a breaking change.

@jreback
Copy link
Contributor

jreback commented Aug 21, 2019

Timezone information, as far as I am concerned, is a presentation information, a bit like the 'freq' attribute. Two timestamps in differing timezones are fundamentally compatible, and losing timezones is about dropping display information. Are there situations where this would be wrong ?

this is not the case at all in pandas; time zones are first class; turning them to UTC is losing information; we have deliberately tried not to do this generally

mixing times zones while possible is often a bug in user code; it’s possible but very likely not something you want library code to do, rather it becomes a burden of the user code

Another point: it already cost me an hour to write code to paper over the issue. I have a non-indecent test base, but real world tests reveal that I have non-handled edge cases. So it's not just a matter of .tz_converting the offending index and the cost of reimplementing myself what .union did previously is going to mount. In other words, this is a breaking change.

not really sure what is your point here. pandas has lots of users; api changes are a fact of life

@ms7463
Copy link
Contributor

ms7463 commented Aug 22, 2019

@TomAugspurger - yes, this exact case was discussed explicitly.

#23538 (comment)

Also FWIW, I think that the fact alone, that converting from an dst aware tz for the Nov dst transition hour can be ambiguous, is enough to disqualify automatically converting to UTC, for mismatched tz merges.

@zogzog
Copy link
Author

zogzog commented Aug 22, 2019

Thanks for the reference. Maybe then a small deprecation period would have been gentler ?

@ArtinSarraf can you expand on your dst transition ambiguity example ? For my personal education I'd like to understand the issue (if only because I currently believe tz-aware stamps don't have ambiguity issues).

@ms7463
Copy link
Contributor

ms7463 commented Aug 22, 2019

Ignore my comment, what I was thinking of wasn't an issue with already tz-aware date ranges, but rather the ambiguity than can occur from localizing a non tz-aware date range.

@zogzog
Copy link
Author

zogzog commented Aug 26, 2019

The very fact tz-aware tstamps are non-abiguous is the reason I think this is a regression. The actual timezones are display information (just like freq). There is no fundamental information loss when the time zone is aligned to UTC.

@jreback
Copy link
Contributor

jreback commented Aug 26, 2019

@zogzog pls read my comments again. there is loss of information as pandas keeeps the tz meta data as part of the dtype; so by definition you are losing information.

@zogzog
Copy link
Author

zogzog commented Aug 26, 2019

I've read your comment. "By definition" doesn't cut it. Did you read mine and try to understand the argument and address it ?

@mroeschke
Copy link
Member

@zogzog this is what @jreback is referring to:

In [1]: utc = pd.Series(pd.date_range('2019', periods=3, freq='D', tz='UTC'))

In [2]: pacific = utc.dt.tz_convert('US/Pacific')

In [3]: utc
Out[3]:
0   2019-01-01 00:00:00+00:00
1   2019-01-02 00:00:00+00:00
2   2019-01-03 00:00:00+00:00
dtype: datetime64[ns, UTC]

In [4]: pacific
Out[4]:
0   2018-12-31 16:00:00-08:00
1   2019-01-01 16:00:00-08:00
2   2019-01-02 16:00:00-08:00
dtype: datetime64[ns, US/Pacific]

# What you are probably referring to
In [5]: utc == pacific
Out[5]:
0    True
1    True
2    True
dtype: bool

In [6]: utc.dtype
Out[6]: datetime64[ns, UTC]

In [7]: pacific.dtype
Out[7]: datetime64[ns, US/Pacific]

# The loss of information that would occur if Index.union coerced to UTC
In [8]: utc.dtype == pacific.dtype
Out[8]: False

@zogzog
Copy link
Author

zogzog commented Aug 26, 2019

@mroeschke thanks, but I had perfectly well understood. I stand by my position that this is a benign loss, in fact a complete non-issue. Apart from the principled position you take (which of course I'm not opposed to in abstract), I would love to know where in practice this would cause issues. Did the previous design cause real-world issues ? Do you have real-world data, like bug reports, indicating that it is wrong to switch the time zone ?

@mroeschke
Copy link
Member

I imagine any piece of logic that is trying to detect the original timezone and suddenly getting UTC would be impacted by automatic coercion to UTC.

As mentioned above, this was a deliberate API change and this is a move toward maintaining original data types. Going to close this issue as there's not much appetite to go the other direction

zogzog added a commit to pythonianfr/tshistory that referenced this issue Dec 29, 2022
see pandas-dev/pandas#28034

We switch every timezone-aware series into utc now.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Index Related to the Index class or subclasses Timezones Timezone data dtype
Projects
None yet
Development

No branches or pull requests

5 participants