Skip to content

BUG: Properly read Categorical msgpacks #14918

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Dec 21, 2016

Conversation

gfyoung
Copy link
Member

@gfyoung gfyoung commented Dec 19, 2016

Patches bug in read_msgpack in which Series categoricals were accidentally being constructed with a non-categorical dtype, resulting in an error.

Closes #14901.

@jreback jreback added Categorical Categorical Data Type Msgpack labels Dec 19, 2016
@codecov-io
Copy link

codecov-io commented Dec 19, 2016

Current coverage is 84.64% (diff: 100%)

Merging #14918 into master will decrease coverage by <.01%

@@             master     #14918   diff @@
==========================================
  Files           144        144          
  Lines         51023      51021     -2   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
- Hits          43190      43187     -3   
- Misses         7833       7834     +1   
  Partials          0          0          

Powered by Codecov. Last update 1678f14...b823ebe

index = obj[u'index']
result = globals()[obj[u'klass']](unconvert(obj[u'data'], dtype,
obj[u'compress']),
index=index,
dtype=np_dtype,
dtype=klass_dtype,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simply pass pd_dtype and this will work (or if not, then we should fix that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you test with datetimetz as well

Copy link
Member Author

@gfyoung gfyoung Dec 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. datetimetz is already tested in this Series.

  2. No, we cannot just pass in pd_dtype because it will fail with tz-aware data-types:

>>> from pandas import Series, Timestamp, read_msgpack
>>> s = Series([Timestamp('20130102', tz='US/Eastern')] * 5)
>>> s2 = read_msgpack(s.to_msgpack())
>>>
>>> s.get_values()
array(['2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-02T05:00:00.000000000'], dtype='datetime64[ns]')
>>>
>>> s2.get_values()
array(['2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000',
       '2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000',
       '2013-01-02T10:00:00.000000000'], dtype='datetime64[ns]')

The reason is that the internal block representation of the Series is altered, even though the external values are identical. This is not an issue with read_msgpack per se but with how Series construct datetimetz arrays.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, then that is an issue of the construction itself. .get_values() is not relevant here. Can you show an example (NOT using message pack), but just a direct construction which causes the issue?

Copy link
Member Author

@gfyoung gfyoung Dec 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback : It is absolutely relevant. The test for Series equality is based on comparing the output of .get_values(). Here is an easy way to reproduce with pure construction (similar to what happens internally in read_msgpack):

>>> from pandas import Series, Timestamp
>>> s = Series([Timestamp('20130102', tz='US/Eastern')] * 5)
>>> s2 = Series(s.get_values(), dtype=s.dtype)
>>>
>>> s.get_values()
array(['2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
       '2013-01-02T05:00:00.000000000'], dtype='datetime64[ns]')
>>>
>>> s2.get_values()
array(['2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000',
       '2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000',
       '2013-01-02T10:00:00.000000000'], dtype='datetime64[ns]')

Copy link
Member Author

@gfyoung gfyoung Dec 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you mean by that. How do you "switch" ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually I'll revise that. I am not sure where you get .get_values() call from. How exactly is that happening?

Copy link
Member Author

@gfyoung gfyoung Dec 19, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What unconvert returns is equivalent to s.get_values().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the correct way to re-serialize.

In [29]: Series(s.values, dtype='M8[ns]').dt.tz_localize('UTC').dt.tz_convert(s.dt.tz)
Out[29]:
0   2013-01-02 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-02 00:00:00-05:00
3   2013-01-02 00:00:00-05:00
4   2013-01-02 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

The issue is you are storing UTC timezones. But you need to then re-convert them back to the correct timezone. This is a standard idiom (e.g. HDFStore does this).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I changed where we put the tz check to do something like that.

@gfyoung gfyoung force-pushed the read-msgpack-categorical branch from 9bb2bee to 76d4aab Compare December 19, 2016 20:19
@jreback
Copy link
Contributor

jreback commented Dec 19, 2016

ok looks reasonable. I haven't checked, but do we have sufficient tz-aware tests for series? if so then lgtm.

@jreback jreback added this to the 0.20.0 milestone Dec 19, 2016
@gfyoung
Copy link
Member Author

gfyoung commented Dec 19, 2016

@jreback : Yes, we do. That's why I realized we couldn't just pass in pd_dtype all the time.

@jreback
Copy link
Contributor

jreback commented Dec 19, 2016

that's fine. as a followup (maybe). It might be cleaner to do all of the tz stuff in unconvert/convert (rather than doing it in each individual method).?

@gfyoung
Copy link
Member Author

gfyoung commented Dec 19, 2016

Perhaps, though I'm not sure how necessary that is for now.

@gfyoung
Copy link
Member Author

gfyoung commented Dec 20, 2016

@jreback : Everything is green. Ready to merge if there are no other concerns.

if tz:
result = klass(arr, index=index, dtype=pd_dtype.base, name=name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you need to do this duplication, create the result, the if a tz, convert, no? (IOW you can just pass pd_dtype, pd_dtype.base is not very useful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, you must initialize as TZ-naive for this to work.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems very hacky. can you find a more generic way to do this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hacky? How so? It's essentially doing what you suggested and using the same if-else branch that was already there. Not really sure what you mean by "more generic" in this context.

@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

ok, this is an actual bug in construction, where we construct incorrectly, then compensate in other places. removing this makes this just work: jreback@24f78cd

here's the bug. I'll make a separate issue, then you can rebase on top of the fix.

In master

In [1]: s = Series(pd.date_range('20130101',periods=3,tz='US/Eastern'))

In [2]: s
Out[2]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [3]: Series(s,dtype=s.dtype)
Out[3]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [5]: Series(s.dt.tz_convert('UTC'),dtype=s.dtype)
Out[5]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
In [6]: Series(s.values,dtype=s.dtype)
Out[6]: 
0   2013-01-01 05:00:00-05:00
1   2013-01-02 05:00:00-05:00
2   2013-01-03 05:00:00-05:00
dtype: datetime64[ns, US/Eastern]

with commit

In [1]: s = Series(pd.date_range('20130101',periods=3,tz='US/Eastern'))

In [2]: Series(s,dtype=s.dtype)
Out[2]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [3]: Series(s.dt.tz_convert('UTC'),dtype=s.dtype)
Out[3]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [4]: Series(s.values,dtype=s.dtype)
Out[4]: 
0   2013-01-01 00:00:00-05:00
1   2013-01-02 00:00:00-05:00
2   2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

see PR #14928

jreback added a commit that referenced this pull request Dec 20, 2016
xref #14918

Author: Jeff Reback <[email protected]>

Closes #14928 from jreback/timez_construction and squashes the following commits:

3dd8e99 [Jeff Reback] BUG: bug in Series construction from UTC
@jreback
Copy link
Contributor

jreback commented Dec 20, 2016

ok #14928 was merged. go ahead and rebase (you need to the code mostly back to what it was)

Patches bug in read_msgpack in which
Series categoricals were accidentally
being constructed with a non-categorical
dtype, resulting in an error.

Closes pandas-devgh-14901.
@gfyoung gfyoung force-pushed the read-msgpack-categorical branch from 76d4aab to b823ebe Compare December 21, 2016 00:38
@gfyoung
Copy link
Member Author

gfyoung commented Dec 21, 2016

@jreback : Change is nice and compact now and passing. Ready to merge if there are no other concerns.

@jreback jreback merged commit 73e2829 into pandas-dev:master Dec 21, 2016
@jreback
Copy link
Contributor

jreback commented Dec 21, 2016

thanks!

@gfyoung gfyoung deleted the read-msgpack-categorical branch December 21, 2016 15:46
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
xref pandas-dev#14918

Author: Jeff Reback <[email protected]>

Closes pandas-dev#14928 from jreback/timez_construction and squashes the following commits:

3dd8e99 [Jeff Reback] BUG: bug in Series construction from UTC
ShaharBental pushed a commit to ShaharBental/pandas that referenced this pull request Dec 26, 2016
Patches bug in read_msgpack in which
Series categoricals were accidentally
being constructed with a non-categorical
dtype, resulting in an error.

Closes pandas-devgh-14901.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants