-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
cc @jbrockmendel, @mroeschke, @gfyoung, @swyoon if you have thoughts here. |
Appears this was an intentional change as described by the original issue: #22256 (comment) Also AFAICT, the coercion to UTC behavior before was not documented in |
I don't see anything in the original issue #22256 (comment) about mixed timezones. So to clarify, what's the expected output of In [8]: pd.read_csv(io.StringIO('a\n2000-01-01T00:00:00+06:00\n2000-01-01T00:00:00+05:00'), parse_dates=['a']).a In 0.23.4, that's Out[4]:
0 1999-12-31 18:00:00
1 1999-12-31 19:00:00
Name: a, dtype: datetime64[ns] and in 0.24.0 that's Out[8]:
0 2000-01-01 00:00:00+06:00
1 2000-01-01 00:00:00+05:00
Name: a, dtype: object If we want the 0.24.0 behavior, do we have an alternative recommendation for "parse this mess into a column of datetimes"? Something like pd.read_csv(..., date_parser=lambda x: pd.to_datetime(x, utc=True)) If that's the recommendation going forward, I can submit a PR updating the release note for 0.24 to note the breaking change, and an example with mixed timezones. |
FWIW I see two upsides to not-converting to UTC. First, conversion is lossy; mixed timezones is a pretty weird special case and if it is intentional, we preserve it. Second, I'm pretty sure this is consistent with to_datetime's treatment of strings, which is a plus. |
Actually, does date_columns = ['a']
df = pd.read_csv(...)
df[date_columns] = df[date_columns].apply(pd.to_datetime, utc=True) so read it in as strings, and then convert later? |
I would distinguish mixed timezones in a storage format like CSV from our internal representation. IMO, it's important to support easily ingesting this kind of data. It does appear to be consistent with In [6]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'])
Out[6]: Index([2000-01-01 00:00:00+06:00, 2000-01-01 00:00:00+05:00], dtype='object')
In [7]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'], utc=True)
Out[7]: DatetimeIndex(['1999-12-31 18:00:00+00:00', '1999-12-31 19:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None) |
Yeah I'd recommend post processing via apply (or The |
Thanks. I'll link to that.
…On Mon, Jan 28, 2019 at 3:52 PM Matthew Roeschke ***@***.***> wrote:
Yeah I'd recommend post processing via apply (or to_datetime directly if
it's just one column).
The to_datetime example above was recently fixed in 0.24.0 as well.
http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#24987 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIk_BnZg4v26cFTOLl6naOGkUqimVks5vH3E0gaJpZM4aWoOk>
.
|
I'm a little late to the discussion, but I do agree with the general sentiment here. I think a documentation change will suffice.
@TomAugspurger : The behavior of |
Sorry, too late to the discussion but we've just update to 0.24.2 from 0.23.4... It breaks a lot of our code which was relying on reading back 'UTC' version for years. Having the fixed offset which most likely is wrong due to DST transitions serves no purpose unless you are localising to a known timezone using IANA timezone name. |
My understanding is that ATM it gives back an array of This was changed because the all-utc version drops information. If a user has a mixed-timezone array, the default is to assume it is intentional. |
from @TomAugspurger example above - the But my point is that "mixed timezones" or mixed timezone offsets is very common for any location with DST transitions... half of the year they are going to be different from the other half... but it is normal... What it was doing before was ok except that Also timezone AINA string is what I think when we talk about timezones as it is unambiguous... yet timezone offset does not give me this information anyway, only how to convert to UTC. I could have 2 AINA timezone producing exactly the same offset strings and hence timestamps for half a year and be wrong the other half of the year or having the DST transition on different days. |
That's true, but coercion to Now straddling the DST transition point has historically been a tricky spot for datetime indices on our end, so making a unilateral decision on how to address them would have been unusual, to say the least. So from those standpoints, I stand by our decision to change the behavior as we did for |
I guess my issue is that I do not get a on a side note there's an amazing difference in speed... buf = io.StringIO()
pd.DataFrame(
index=pd.date_range(
start='2015-03-10T00:00',
end='2020-03-12T00:00',
tz='America/Havana',
freq='1H'
)
).to_csv(buf)
%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], infer_datetime_format=True)
4.78 s ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], date_parser=lambda x: pd.to_datetime(x, utc=True))
750 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) so it seems using This difference is only present for a mixed timezone offset. When timezone offset is '00:00' I didn't observer any difference between different modes and it is around |
Previously, a column in a CSV with mixed timezones would (I think) convert each value to UTC and discard the tzinfo.
On 0.23.4 that's
On 0.24 that's
I'm not sure what the expected behavior is here, but I think the old behavior is as good as any.
I haven't verified, but #22380 seems like a likely candidate for introducing the change.
The text was updated successfully, but these errors were encountered: