BUG: Properly read Categorical msgpacks #14918

gfyoung · 2016-12-19T09:20:50Z

Patches bug in read_msgpack in which Series categoricals were accidentally being constructed with a non-categorical dtype, resulting in an error.

Closes #14901.

codecov-io · 2016-12-19T11:36:24Z

Current coverage is 84.64% (diff: 100%)

Merging #14918 into master will decrease coverage by <.01%

@@             master     #14918   diff @@
==========================================
  Files           144        144          
  Lines         51023      51021     -2   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43190      43187     -3   
- Misses         7833       7834     +1   
  Partials          0          0

Powered by Codecov. Last update 1678f14...b823ebe

jreback · 2016-12-19T11:37:17Z

pandas/io/packers.py

        index = obj[u'index']
        result = globals()[obj[u'klass']](unconvert(obj[u'data'], dtype,
                                                    obj[u'compress']),
                                          index=index,
-                                          dtype=np_dtype,
+                                          dtype=klass_dtype,


I think you can simply pass pd_dtype and this will work (or if not, then we should fix that)

can you test with datetimetz as well

datetimetz is already tested in this Series.

No, we cannot just pass in pd_dtype because it will fail with tz-aware data-types:

>>> from pandas import Series, Timestamp, read_msgpack >>> s = Series([Timestamp('20130102', tz='US/Eastern')] * 5) >>> s2 = read_msgpack(s.to_msgpack()) >>> >>> s.get_values() array(['2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000'], dtype='datetime64[ns]') >>> >>> s2.get_values() array(['2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000'], dtype='datetime64[ns]')

The reason is that the internal block representation of the Series is altered, even though the external values are identical. This is not an issue with read_msgpack per se but with how Series construct datetimetz arrays.

well, then that is an issue of the construction itself. .get_values() is not relevant here. Can you show an example (NOT using message pack), but just a direct construction which causes the issue?

@jreback : It is absolutely relevant. The test for Series equality is based on comparing the output of .get_values(). Here is an easy way to reproduce with pure construction (similar to what happens internally in read_msgpack):

>>> from pandas import Series, Timestamp >>> s = Series([Timestamp('20130102', tz='US/Eastern')] * 5) >>> s2 = Series(s.get_values(), dtype=s.dtype) >>> >>> s.get_values() array(['2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000'], dtype='datetime64[ns]') >>> >>> s2.get_values() array(['2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000'], dtype='datetime64[ns]')

I don't understand what you mean by that. How do you "switch" ?

actually I'll revise that. I am not sure where you get .get_values() call from. How exactly is that happening?

What unconvert returns is equivalent to s.get_values().

Here is the correct way to re-serialize.

In [29]: Series(s.values, dtype='M8[ns]').dt.tz_localize('UTC').dt.tz_convert(s.dt.tz) Out[29]: 0 2013-01-02 00:00:00-05:00 1 2013-01-02 00:00:00-05:00 2 2013-01-02 00:00:00-05:00 3 2013-01-02 00:00:00-05:00 4 2013-01-02 00:00:00-05:00 dtype: datetime64[ns, US/Eastern]

The issue is you are storing UTC timezones. But you need to then re-convert them back to the correct timezone. This is a standard idiom (e.g. HDFStore does this).

I see. I changed where we put the tz check to do something like that.

jreback · 2016-12-19T20:28:22Z

ok looks reasonable. I haven't checked, but do we have sufficient tz-aware tests for series? if so then lgtm.

gfyoung · 2016-12-19T20:43:22Z

@jreback : Yes, we do. That's why I realized we couldn't just pass in pd_dtype all the time.

jreback · 2016-12-19T20:57:28Z

that's fine. as a followup (maybe). It might be cleaner to do all of the tz stuff in unconvert/convert (rather than doing it in each individual method).?

gfyoung · 2016-12-19T20:59:21Z

Perhaps, though I'm not sure how necessary that is for now.

gfyoung · 2016-12-20T00:35:38Z

@jreback : Everything is green. Ready to merge if there are no other concerns.

jreback · 2016-12-20T13:42:16Z

pandas/io/packers.py

        if tz:
+            result = klass(arr, index=index, dtype=pd_dtype.base, name=name)


I don't think you need to do this duplication, create the result, the if a tz, convert, no? (IOW you can just pass pd_dtype, pd_dtype.base is not very useful

No, you must initialize as TZ-naive for this to work.

this seems very hacky. can you find a more generic way to do this.

Hacky? How so? It's essentially doing what you suggested and using the same if-else branch that was already there. Not really sure what you mean by "more generic" in this context.

jreback · 2016-12-20T16:47:39Z

ok, this is an actual bug in construction, where we construct incorrectly, then compensate in other places. removing this makes this just work: jreback@24f78cd

here's the bug. I'll make a separate issue, then you can rebase on top of the fix.

In master

In [1]: s = Series(pd.date_range('20130101',periods=3,tz='US/Eastern'))

In [2]: s
Out[2]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [3]: Series(s,dtype=s.dtype)
Out[3]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [5]: Series(s.dt.tz_convert('UTC'),dtype=s.dtype)
Out[5]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
In [6]: Series(s.values,dtype=s.dtype)
Out[6]: 
0   2013-01-01 05:00:00-05:00
1   2013-01-02 05:00:00-05:00
2   2013-01-03 05:00:00-05:00
dtype: datetime64[ns, US/Eastern]

with commit

In [1]: s = Series(pd.date_range('20130101',periods=3,tz='US/Eastern'))

In [2]: Series(s,dtype=s.dtype)
Out[2]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [3]: Series(s.dt.tz_convert('UTC'),dtype=s.dtype)
Out[3]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [4]: Series(s.values,dtype=s.dtype)
Out[4]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

see PR #14928

xref #14918 Author: Jeff Reback <[email protected]> Closes #14928 from jreback/timez_construction and squashes the following commits: 3dd8e99 [Jeff Reback] BUG: bug in Series construction from UTC

jreback · 2016-12-20T17:56:02Z

ok #14928 was merged. go ahead and rebase (you need to the code mostly back to what it was)

Patches bug in read_msgpack in which Series categoricals were accidentally being constructed with a non-categorical dtype, resulting in an error. Closes pandas-devgh-14901.

gfyoung · 2016-12-21T06:04:45Z

@jreback : Change is nice and compact now and passing. Ready to merge if there are no other concerns.

jreback · 2016-12-21T11:28:06Z

thanks!

xref pandas-dev#14918 Author: Jeff Reback <[email protected]> Closes pandas-dev#14928 from jreback/timez_construction and squashes the following commits: 3dd8e99 [Jeff Reback] BUG: bug in Series construction from UTC

Patches bug in read_msgpack in which Series categoricals were accidentally being constructed with a non-categorical dtype, resulting in an error. Closes pandas-devgh-14901.

jreback added Categorical Categorical Data Type Msgpack labels Dec 19, 2016

jreback reviewed Dec 19, 2016

View reviewed changes

gfyoung force-pushed the read-msgpack-categorical branch from 9bb2bee to 76d4aab Compare December 19, 2016 20:19

jreback added this to the 0.20.0 milestone Dec 19, 2016

jreback reviewed Dec 20, 2016

View reviewed changes

jreback mentioned this pull request Dec 20, 2016

BUG: bug in Series construction from UTC #14928

Closed

BUG: Properly read Categorical msgpacks

b823ebe

Patches bug in read_msgpack in which Series categoricals were accidentally being constructed with a non-categorical dtype, resulting in an error. Closes pandas-devgh-14901.

gfyoung force-pushed the read-msgpack-categorical branch from 76d4aab to b823ebe Compare December 21, 2016 00:38

jreback merged commit 73e2829 into pandas-dev:master Dec 21, 2016

gfyoung deleted the read-msgpack-categorical branch December 21, 2016 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Properly read Categorical msgpacks #14918

BUG: Properly read Categorical msgpacks #14918

gfyoung commented Dec 19, 2016

codecov-io commented Dec 19, 2016 •

edited

Loading

jreback Dec 19, 2016

jreback Dec 19, 2016

gfyoung Dec 19, 2016 •

edited

Loading

jreback Dec 19, 2016

gfyoung Dec 19, 2016 •

edited

Loading

gfyoung Dec 19, 2016 •

edited

Loading

jreback Dec 19, 2016

gfyoung Dec 19, 2016 •

edited

Loading

jreback Dec 19, 2016

gfyoung Dec 19, 2016

jreback commented Dec 19, 2016

gfyoung commented Dec 19, 2016

jreback commented Dec 19, 2016

gfyoung commented Dec 19, 2016

gfyoung commented Dec 20, 2016

jreback Dec 20, 2016

gfyoung Dec 20, 2016

jreback Dec 20, 2016

gfyoung Dec 20, 2016

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

gfyoung commented Dec 21, 2016 •

edited

Loading

jreback commented Dec 21, 2016

		if tz:
		result = klass(arr, index=index, dtype=pd_dtype.base, name=name)

BUG: Properly read Categorical msgpacks #14918

BUG: Properly read Categorical msgpacks #14918

Conversation

gfyoung commented Dec 19, 2016

codecov-io commented Dec 19, 2016 • edited Loading

Current coverage is 84.64% (diff: 100%)

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 19, 2016 • edited Loading

Choose a reason for hiding this comment

gfyoung Dec 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gfyoung Dec 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 19, 2016

gfyoung commented Dec 19, 2016

jreback commented Dec 19, 2016

gfyoung commented Dec 19, 2016

gfyoung commented Dec 20, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented Dec 20, 2016

jreback commented Dec 20, 2016

gfyoung commented Dec 21, 2016 • edited Loading

jreback commented Dec 21, 2016

codecov-io commented Dec 19, 2016 •

edited

Loading

gfyoung Dec 19, 2016 •

edited

Loading

gfyoung Dec 19, 2016 •

edited

Loading

gfyoung Dec 19, 2016 •

edited

Loading

gfyoung Dec 19, 2016 •

edited

Loading

gfyoung commented Dec 21, 2016 •

edited

Loading