0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

TomAugspurger · 2019-01-28T19:38:17Z

Previously, a column in a CSV with mixed timezones would (I think) convert each value to UTC and discard the tzinfo.

import pandas as pd
import io

content = """\
Sat, 22 Apr 2017 15:11:58 -0500
Fri, 21 Apr 2017 14:20:57 -0500
Thu, 9 Mar 2017 11:15:30 -0600"""

df = pd.read_csv(io.StringIO(content), parse_dates=True, header=None, names=['day', 'datetime'], index_col='datetime')

On 0.23.4 that's

In [7]: df.index
Out[7]:
DatetimeIndex(['2017-04-22 20:11:58', '2017-04-21 19:20:57',
               '2017-03-09 17:15:30'],
              dtype='datetime64[ns]', name='datetime', freq=None)

On 0.24 that's

In [7]: df.index
Out[7]:
Index([2017-04-22 15:11:58-05:00, 2017-04-21 14:20:57-05:00,
       2017-03-09 11:15:30-06:00],
      dtype='object', name='datetime')

I'm not sure what the expected behavior is here, but I think the old behavior is as good as any.

I haven't verified, but #22380 seems like a likely candidate for introducing the change.

TomAugspurger · 2019-01-28T19:39:07Z

cc @jbrockmendel, @mroeschke, @gfyoung, @swyoon if you have thoughts here.

mroeschke · 2019-01-28T19:51:17Z

Appears this was an intentional change as described by the original issue: #22256 (comment)

Also AFAICT, the coercion to UTC behavior before was not documented in read_csv(?). In general, I am keen on preserving the distinct UTC offsets instead of coercing to UTC as it leads to less surprises.

TomAugspurger · 2019-01-28T21:37:18Z

I don't see anything in the original issue #22256 (comment) about mixed timezones.

So to clarify, what's the expected output of

In [8]: pd.read_csv(io.StringIO('a\n2000-01-01T00:00:00+06:00\n2000-01-01T00:00:00+05:00'), parse_dates=['a']).a

In 0.23.4, that's

Out[4]:
0   1999-12-31 18:00:00
1   1999-12-31 19:00:00
Name: a, dtype: datetime64[ns]

and in 0.24.0 that's

Out[8]:
0    2000-01-01 00:00:00+06:00
1    2000-01-01 00:00:00+05:00
Name: a, dtype: object

If we want the 0.24.0 behavior, do we have an alternative recommendation for "parse this mess into a column of datetimes"? Something like

pd.read_csv(..., date_parser=lambda x: pd.to_datetime(x, utc=True))

If that's the recommendation going forward, I can submit a PR updating the release note for 0.24 to note the breaking change, and an example with mixed timezones.

jbrockmendel · 2019-01-28T21:40:56Z

FWIW I see two upsides to not-converting to UTC. First, conversion is lossy; mixed timezones is a pretty weird special case and if it is intentional, we preserve it. Second, I'm pretty sure this is consistent with to_datetime's treatment of strings, which is a plus.

TomAugspurger · 2019-01-28T21:41:48Z

Actually, does date_parser operate row-wise? We probably don't want to recommend that. The alternative would be something like

date_columns = ['a']
df = pd.read_csv(...)
df[date_columns] = df[date_columns].apply(pd.to_datetime, utc=True)

so read it in as strings, and then convert later?

TomAugspurger · 2019-01-28T21:44:08Z

mixed timezones is a pretty weird special case and if it is intentional, we preserve it.

I would distinguish mixed timezones in a storage format like CSV from our internal representation. IMO, it's important to support easily ingesting this kind of data.

It does appear to be consistent with to_datetime

In [6]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'])
Out[6]: Index([2000-01-01 00:00:00+06:00, 2000-01-01 00:00:00+05:00], dtype='object')

In [7]: pd.to_datetime(['2000-01-01T00:00:00+06:00', '2000-01-01T00:00:00+05:00'], utc=True)
Out[7]: DatetimeIndex(['1999-12-31 18:00:00+00:00', '1999-12-31 19:00:00+00:00'], dtype='datetime64[ns, UTC]', freq=None)

mroeschke · 2019-01-28T21:52:45Z

Yeah I'd recommend post processing via apply (or to_datetime directly if it's just one column).

The to_datetime example above was recently fixed in 0.24.0 as well. http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets

TomAugspurger · 2019-01-28T22:03:23Z

Thanks. I'll link to that.

…

On Mon, Jan 28, 2019 at 3:52 PM Matthew Roeschke ***@***.***> wrote: Yeah I'd recommend post processing via apply (or to_datetime directly if it's just one column). The to_datetime example above was recently fixed in 0.24.0 as well. http://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.24.0.html#parsing-datetime-strings-with-timezone-offsets — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#24987 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIk_BnZg4v26cFTOLl6naOGkUqimVks5vH3E0gaJpZM4aWoOk> .

gfyoung · 2019-01-28T22:42:28Z

I'm a little late to the discussion, but I do agree with the general sentiment here. I think a documentation change will suffice.

Actually, does date_parser operate row-wise

@TomAugspurger : The behavior of date_parser varies depending on how parse_dates is passed in. However, IIUC it does not operate "row-wise" as I believe you are implying.

Closes pandas-dev#24987

TomAugspurger · 2019-01-29T02:53:56Z

Thanks @gfyoung, that's the recommendation I went with in #24989.

hendrics · 2019-03-18T10:39:26Z

Sorry, too late to the discussion but we've just update to 0.24.2 from 0.23.4... It breaks a lot of our code which was relying on reading back 'UTC' version for years.
This is the sainest approach and date_parser=lambda x: pd.to_datetime(x, utc=True) should be the default really and not give back a set of object strings?

Having the fixed offset which most likely is wrong due to DST transitions serves no purpose unless you are localising to a known timezone using IANA timezone name.

jbrockmendel · 2019-03-18T15:07:18Z

This is the sainest approach and date_parser=lambda x: pd.to_datetime(x, utc=True) should be the default really and not give back a set of object strings?

My understanding is that ATM it gives back an array of Timestamp objects, not strings. Is it actually giving an array of strings?

This was changed because the all-utc version drops information. If a user has a mixed-timezone array, the default is to assume it is intentional.

hendrics · 2019-03-18T17:09:05Z

On 0.24 that's

In [7]: df.index
Out[7]:
Index([2017-04-22 15:11:58-05:00, 2017-04-21 14:20:57-05:00,
      2017-03-09 11:15:30-06:00],
     dtype='object', name='datetime')

from @TomAugspurger example above - the dtype is object - but yeah the individual members of the index are of type datetime.datetime with different offset... not the Timestamp objects per se... and not the strings as I've implied earlier... my bad...

But my point is that "mixed timezones" or mixed timezone offsets is very common for any location with DST transitions... half of the year they are going to be different from the other half... but it is normal... What it was doing before was ok except that UTC timezone was not set... My point is it is better default behaviour, and it already does it for +00:00 offsets.... if I have +00:00 for all the timestamps it slaps UTC timezone on the index... And that is probably the right default thing especially given that it was doing the same before without slapping a timezone...

Also timezone AINA string is what I think when we talk about timezones as it is unambiguous... yet timezone offset does not give me this information anyway, only how to convert to UTC.

I could have 2 AINA timezone producing exactly the same offset strings and hence timestamps for half a year and be wrong the other half of the year or having the DST transition on different days.

gfyoung · 2019-03-18T19:09:38Z

But my point is that "mixed timezones" or mixed timezone offsets is very common for any location with DST transitions

That's true, but coercion to UTC is imposing an unnecessary transformation on datetime objects.
In addition, the behavior was not documented. From a design standpoint, we aim to give the end user their CSV data in as unadulterated form as possible because after all, the function is called read_csv. Nowhere in the name implies that we perform transformations on your data.

Now straddling the DST transition point has historically been a tricky spot for datetime indices on our end, so making a unilateral decision on how to address them would have been unusual, to say the least.

So from those standpoints, I stand by our decision to change the behavior as we did for 0.24.0.

hendrics · 2019-03-19T13:48:09Z

I guess my issue is that I do not get a DatetimeIndex any more if I have mixed timestamps....

on a side note there's an amazing difference in speed...

buf = io.StringIO()
pd.DataFrame(
    index=pd.date_range(
        start='2015-03-10T00:00', 
        end='2020-03-12T00:00', 
        tz='America/Havana', 
        freq='1H'
      )
).to_csv(buf)


%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], infer_datetime_format=True)
4.78 s ± 19.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit buf.seek(0); pd.read_csv(buf, parse_dates=[0], date_parser=lambda x: pd.to_datetime(x, utc=True))
750 ms ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

so it seems using pd.read_csv(..., date_parser=pd.to_datetime(x, utc=True)) is beneficial for the speed performance...

This difference is only present for a mixed timezone offset. When timezone offset is '00:00' I didn't observer any difference between different modes and it is around 160ms for the same size data...

TomAugspurger added IO CSV read_csv, to_csv Timezones Timezone data dtype labels Jan 28, 2019

TomAugspurger mentioned this issue Jan 28, 2019

adapt to pandas changes bokeh/bokeh#8588

Merged

2 tasks

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Jan 28, 2019

DOC: Document breaking change to read_csv

7517dd0

Closes pandas-dev#24987

TomAugspurger mentioned this issue Jan 28, 2019

DOC: Document breaking change to read_csv #24989

Merged

jreback added this to the 0.24.1 milestone Jan 29, 2019

jreback closed this as completed in #24989 Jan 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

TomAugspurger commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

mroeschke commented Jan 28, 2019 •

edited

Loading

TomAugspurger commented Jan 28, 2019

jbrockmendel commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

mroeschke commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019 via email

gfyoung commented Jan 28, 2019

TomAugspurger commented Jan 29, 2019

hendrics commented Mar 18, 2019

jbrockmendel commented Mar 18, 2019

hendrics commented Mar 18, 2019 •

edited

Loading

gfyoung commented Mar 18, 2019

hendrics commented Mar 19, 2019 •

edited

Loading

0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

0.23.4 changed read_csv parsing for a mixed-timezone datetimes #24987

Comments

TomAugspurger commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

mroeschke commented Jan 28, 2019 • edited Loading

TomAugspurger commented Jan 28, 2019

jbrockmendel commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019

mroeschke commented Jan 28, 2019

TomAugspurger commented Jan 28, 2019 via email

gfyoung commented Jan 28, 2019

TomAugspurger commented Jan 29, 2019

hendrics commented Mar 18, 2019

jbrockmendel commented Mar 18, 2019

hendrics commented Mar 18, 2019 • edited Loading

gfyoung commented Mar 18, 2019

hendrics commented Mar 19, 2019 • edited Loading

mroeschke commented Jan 28, 2019 •

edited

Loading

hendrics commented Mar 18, 2019 •

edited

Loading

hendrics commented Mar 19, 2019 •

edited

Loading