Skip to content

Creating Series from DatetimeIndex w/ tz loses tz info #6032

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cancan101 opened this issue Jan 22, 2014 · 29 comments · Fixed by #6398
Closed

Creating Series from DatetimeIndex w/ tz loses tz info #6032

cancan101 opened this issue Jan 22, 2014 · 29 comments · Fixed by #6398
Labels
API Design Datetime Datetime data dtype Timezones Timezone data dtype
Milestone

Comments

@cancan101
Copy link
Contributor

The original DatetimeIndex has the tz:

In [1]: pd.DatetimeIndex(
   ...:     pd.tseries.tools.to_datetime(['2013-1-1 13:00'], errors="raise")).tz_localize('US/Pacific')[0]
Out[1]: Timestamp('2013-01-01 13:00:00-0800', tz='US/Pacific')

But when converted to a Series, the tz info is lost:

In [2]: pd.Series(pd.DatetimeIndex(
   ...:     pd.tseries.tools.to_datetime(['2013-1-1 13:00'], errors="raise")).tz_localize('US/Pacific'))[0]
Out[2]: Timestamp('2013-01-01 21:00:00', tz=None)

If this is by design, then a warning would be nice.

This issue came up when I tried to add a list of timezone aware dates to a DataFrame.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

I am not sure what is correct here; using to_series() you get the index preserverd (with TZ) on the resulant Series, with the values as in UTC.

In [39]: pd.DatetimeIndex(pd.tseries.tools.to_datetime(['2013-1-1 13:00'], errors="raise")).tz_localize('US/Pacific').to_series().index
Out[39]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 13:00:00-08:00]
Length: 1, Freq: None, Timezone: US/Pacific

So I think this is correct

@cancan101 makes sense?

@jorisvandenbossche
Copy link
Member

I think @cancan101 means that the timezone is lost in the value in the Series, not in the index. So:

In [4]: pd.to_datetime(['2013-1-1 13:00'], errors="raise") \
               .tz_localize('US/Pacific').to_series()[0]
Out[4]: Timestamp('2013-01-01 21:00:00', tz=None)

And I think this is by design (or by 'limitation of numpy'). When the datetime values are in a column or series, and not in the index, it are just np.datetime64 instances:

In [5]: pd.to_datetime(['2013-1-1 13:00'], errors="raise") \
             .tz_localize('US/Pacifc').to_series().values
Out[5]: array(['2013-01-01T22:00:00.000000000+0100'], dtype='datetime64[ns]')

And numpy has no real timezone support. You can see that the value is correct (I mean, the offset is of the timezone is applied), but numpy prints it in local time.

@cancan101
Copy link
Contributor Author

The problem actually seems to be much simpler (and maybe easier to solve). I would think that both of these statements should return the same thing but they do not:

In [5]: i = pd.DatetimeIndex(pd.tseries.tools.to_datetime(['2013-1-1 13:00'], errors="raise")).tz_localize('US/Pacific')

In [6]: i.to_series()

Out[6]:
2013-01-01 13:00:00-08:00   2013-01-01 21:00:00
dtype: datetime64[ns]

In [7]: pd.Series(i)

Out[7]:
0   2013-01-01 21:00:00
dtype: datetime64[ns]

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

@jorisvandenbossche well it could be preserved, but then it would automatically be of object dtype. I only think this shouuld be done very explicity, e.g. the user is very aware of what they are doing

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

ok...so its the to_series() ....why don't you modify that and see what breaks (may not be any tests for it actually)

@jorisvandenbossche
Copy link
Member

@cancan101 Ah, so you were talking about the index. Misunderstood

@cancan101
Copy link
Contributor Author

Well not totally. Maybe it would also be good to have an option to pass to to_series to preserve timezone (and change the type to object).

@cancan101
Copy link
Contributor Author

Something like:

In [6]: i.to_series(keep_tz=True)

Out[6]:
2013-01-01 13:00:00-08:00   2013-01-01 13:00:00-08:00
dtype: object

@jorisvandenbossche
Copy link
Member

But it should be?

In [6]: i.to_series(keep_tz=True)

Out[6]:
0   2013-01-01 13:00:00-08:00
dtype: object

EDIT: no, you are right. This seems documented behaviour. So pd.Series(idx) and idx.to_series() will give different results (http://pandas.pydata.org/pandas-docs/dev/generated/pandas.Index.to_series.html#pandas.Index.to_series)

@cancan101
Copy link
Contributor Author

Not really sure about that. IMO, either way there seems to be some consistency issues given that i.to_series() and Series(i) return w/ and w/o an index.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

Series does a lot more inferences, and the tz is NOT passed currently for the values. The question is what is the correct behavior?

should both index & values be the same? (for a DatetimeIndex)

@jreback jreback added API Design and removed Bug labels Feb 18, 2014
@cancan101
Copy link
Contributor Author

The issue for me wasn't actually an explicit call to the Series constructor. Rather what I did was something like this:

df["b"] = pd.DatetimeIndex(pd.tseries.tools.to_datetime(['2013-1-1 13:00'], errors="raise")).tz_localize('US/Pacific')

in which case the timezone info gets silently stripped from the new column I have added to the DataFrame.

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

So this is prob a bug then a conversion of a daatimeiex that has a timezone is getting coerced to datetime64[ns]], but instead should remain as object

In [1]: s = pd.DatetimeIndex(pd.tseries.tools.to_datetime(['2013-1-1 13:00'], errors="raise")).tz_localize('US/Pacific')

In [2]: s
Out[2]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 13:00:00-08:00]
Length: 1, Freq: None, Timezone: US/Pacific

In [3]: df = DataFrame(np.random.randn(1,1),columns=['A'])

In [4]: df['B'] = s

In [5]: df
Out[5]: 
         A                   B
0 -0.25968 2013-01-01 21:00:00

[1 rows x 2 columns]

In [6]: df.dtypes
Out[6]: 
A           float64
B    datetime64[ns]
dtype: object

In [7]: df['B'] = s.astype(object)

In [8]: df
Out[8]: 
         A                          B
0 -0.25968  2013-01-01 13:00:00-08:00

[1 rows x 2 columns]

In [9]: df.dtypes
Out[9]: 
A    float64
B     object
dtype: object

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

PR is #6398

In [1]: i = pd.DatetimeIndex(pd.tseries.tools.to_datetime(['2013-1-1 13:00','2013-1-2 14:00'], errors="raise")).tz_localize('US/Pacific')

In [2]: i
Out[2]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 13:00:00-08:00, 2013-01-02 14:00:00-08:00]
Length: 2, Freq: None, Timezone: US/Pacific

In [3]: df = DataFrame(np.random.randn(2,1),columns=['A'])

In [4]: Series(i)
Out[4]: 
0    2013-01-01 13:00:00-08:00
1    2013-01-02 14:00:00-08:00
dtype: object

In [5]: df['B'] = i

In [6]: df
Out[6]: 
          A                          B
0 -1.028888  2013-01-01 13:00:00-08:00
1  0.157796  2013-01-02 14:00:00-08:00

[2 rows x 2 columns]

In [7]: df.dtypes
Out[7]: 
A    float64
B     object
dtype: object

In [8]: i.to_series(keep_tz=True)
Out[8]: 
2013-01-01 13:00:00-08:00    2013-01-01 13:00:00-08:00
2013-01-02 14:00:00-08:00    2013-01-02 14:00:00-08:00
dtype: object

In [9]: df['C'] = i.to_series().reset_index(drop=True)

In [10]: df
Out[10]: 
          A                          B                   C
0 -1.028888  2013-01-01 13:00:00-08:00 2013-01-01 21:00:00
1  0.157796  2013-01-02 14:00:00-08:00 2013-01-02 22:00:00

[2 rows x 3 columns]

In [11]: df.dtypes
Out[11]: 
A           float64
B            object
C    datetime64[ns]
dtype: object

In [12]: Series(i).astype('datetime64[ns]')
Out[12]: 
0   2013-01-01 21:00:00
1   2013-01-02 22:00:00
dtype: datetime64[ns]

@jorisvandenbossche
Copy link
Member

I personally don't see this as a bug, more an oddity which you should be warned for, so don't like the change. I would expect that my DatetimeIndex is converted to a datetime64 column, as are arrays with datetime.datetime objects, and would be surprised if this is not the case (but maybe I am just brainwashed by the way it is now .. ).

@jorisvandenbossche
Copy link
Member

Actually, arrays with datetime.datetime objects that have a time zone are also converted to object ...

@cancan101
Copy link
Contributor Author

It looks like the underlying numpy array should support timezones: http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

@cancan101 numpy tz is not used by pandas at all, buggy < 1.8

@jorisvandenbossche
Copy link
Member

I think the 'ideal' solution in the long term would be that all datetime-like columns/index (with and without timezones) would be converted to datetime64 (when timezone support in numpy has improved), but the question is what to do now:

  • convert to datetime64 and warn users tzinfo is lost and point to use of asobject to keep it
  • or automatically convert to object (as the PR of jreback) and let users use astype('datetime64[ns]') if they want that

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

@jorisvandenbossche to your last point (this was already the case)

In [1]: i = pd.DatetimeIndex(pd.tseries.tools.to_datetime(['2013-1-1 13:00','2013-1-2 14:00'], errors="raise")).tz_localize('US/Pacific')

In [2]: df = DataFrame(np.random.randn(2,1),columns=['A'])

In [3]: i.to_pydatetime()
Out[3]: 
array([ datetime.datetime(2013, 1, 1, 13, 0, tzinfo=<DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>),
       datetime.datetime(2013, 1, 2, 14, 0, tzinfo=<DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>)], dtype=object)

In [4]: df['D'] = i.to_pydatetime()

In [5]: df['D']
Out[5]: 
0    2013-01-01 13:00:00-08:00
1    2013-01-02 14:00:00-08:00
Name: D, dtype: object

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

@jorisvandenbossche I don't think this is really a problem more of an inconcistency of that assigning a list of datetimes with tz's worked, but assigning an DatetimeIndex always forced conversion.

Now they both do the same thing. I think if a user has a tz, they should keep it (I agree maybe a warning could be in order though) thoughts on that?

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

I am not sure their are a lot of tests for this kind of thing. very little breakage when I made this change. It seems reasonable (as I think it was slightly inconsistent before). The fundamental issue is that ATM a Series/column of a Frame must be object dtype if it has a time zone. In theory it could be datetime64[ns] with a timezone attached to the block, but that is pretty complicated and IMHO not worth the effort.

or maybe push the timezones down to numpy (if >= 1.8)

@jreback
Copy link
Contributor

jreback commented Feb 18, 2014

all of the issues here have been fixed in pandas long ago (esp the dst transition issue, well most of those issues)

http://numpy-discussion.10968.n7.nabble.com/timezones-and-datetime64-td33407.html

@cancan101
Copy link
Contributor Author

At some point the tz can also be stored in numpy once users are on new
enough version (#6032 (comment) and http://docs.scipy.org/doc/numpy/reference/arrays.datetime.html).

@jorisvandenbossche
Copy link
Member

@jreback Indeed, I agree this is indeed consistent with datetime.datetime objects.

Still something:

With your PR merged, you get this (with the examples of above):

In [24]: pd.Series(i)
Out[24]:
0    2013-01-01 13:00:00-08:00
1    2013-01-02 14:00:00-08:00
dtype: object
In [25]: pd.Series(i).values
Out[25]: Index([2013-01-01 13:00:00-08:00, 2013-01-02 14:00:00-08:00], dtype='object')

Strange? You don't get an array but an Index object. When directly using an array of Timestamps with timezone, it works:

In [28]: pd.Series(np.array([pd.Timestamp('2013-01-01 13:00:00-0800', tz='US/Pacific'),
    pd.Timestamp('2013-01-01 14:00:00-0800', tz='US/Pacific')], dtype="object")).values
Out[28]:
array([Timestamp('2013-01-01 13:00:00-0800', tz='US/Pacific'),
       Timestamp('2013-01-01 14:00:00-0800', tz='US/Pacific')], dtype=object)

@jreback
Copy link
Contributor

jreback commented Feb 20, 2014

hmm....that is wrong

@jreback
Copy link
Contributor

jreback commented Feb 20, 2014

@jorisvandenbossche ok...fixed in #6419 thxs

@jorisvandenbossche
Copy link
Member

@jreback super!

@jreback
Copy link
Contributor

jreback commented Feb 20, 2014

@jorisvandenbossche fixed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype Timezones Timezone data dtype
Projects
None yet
3 participants