-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Properly read Categorical msgpacks #14918
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Properly read Categorical msgpacks #14918
Conversation
Current coverage is 84.64% (diff: 100%)@@ master #14918 diff @@
==========================================
Files 144 144
Lines 51023 51021 -2
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
- Hits 43190 43187 -3
- Misses 7833 7834 +1
Partials 0 0
|
index = obj[u'index'] | ||
result = globals()[obj[u'klass']](unconvert(obj[u'data'], dtype, | ||
obj[u'compress']), | ||
index=index, | ||
dtype=np_dtype, | ||
dtype=klass_dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simply pass pd_dtype and this will work (or if not, then we should fix that)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you test with datetimetz as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
datetimetz
is already tested in thisSeries
. -
No, we cannot just pass in
pd_dtype
because it will fail with tz-aware data-types:
>>> from pandas import Series, Timestamp, read_msgpack
>>> s = Series([Timestamp('20130102', tz='US/Eastern')] * 5)
>>> s2 = read_msgpack(s.to_msgpack())
>>>
>>> s.get_values()
array(['2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
'2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
'2013-01-02T05:00:00.000000000'], dtype='datetime64[ns]')
>>>
>>> s2.get_values()
array(['2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000',
'2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000',
'2013-01-02T10:00:00.000000000'], dtype='datetime64[ns]')
The reason is that the internal block representation of the Series
is altered, even though the external values are identical. This is not an issue with read_msgpack
per se but with how Series
construct datetimetz
arrays.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well, then that is an issue of the construction itself. .get_values()
is not relevant here. Can you show an example (NOT using message pack), but just a direct construction which causes the issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback : It is absolutely relevant. The test for Series
equality is based on comparing the output of .get_values()
. Here is an easy way to reproduce with pure construction (similar to what happens internally in read_msgpack
):
>>> from pandas import Series, Timestamp
>>> s = Series([Timestamp('20130102', tz='US/Eastern')] * 5)
>>> s2 = Series(s.get_values(), dtype=s.dtype)
>>>
>>> s.get_values()
array(['2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
'2013-01-02T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
'2013-01-02T05:00:00.000000000'], dtype='datetime64[ns]')
>>>
>>> s2.get_values()
array(['2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000',
'2013-01-02T10:00:00.000000000', '2013-01-02T10:00:00.000000000',
'2013-01-02T10:00:00.000000000'], dtype='datetime64[ns]')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what you mean by that. How do you "switch" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually I'll revise that. I am not sure where you get .get_values()
call from. How exactly is that happening?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What unconvert
returns is equivalent to s.get_values()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the correct way to re-serialize.
In [29]: Series(s.values, dtype='M8[ns]').dt.tz_localize('UTC').dt.tz_convert(s.dt.tz)
Out[29]:
0 2013-01-02 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-02 00:00:00-05:00
3 2013-01-02 00:00:00-05:00
4 2013-01-02 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
The issue is you are storing UTC timezones. But you need to then re-convert them back to the correct timezone. This is a standard idiom (e.g. HDFStore does this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. I changed where we put the tz
check to do something like that.
9bb2bee
to
76d4aab
Compare
ok looks reasonable. I haven't checked, but do we have sufficient tz-aware tests for series? if so then lgtm. |
@jreback : Yes, we do. That's why I realized we couldn't just pass in |
that's fine. as a followup (maybe). It might be cleaner to do all of the tz stuff in unconvert/convert (rather than doing it in each individual method).? |
Perhaps, though I'm not sure how necessary that is for now. |
@jreback : Everything is green. Ready to merge if there are no other concerns. |
if tz: | ||
result = klass(arr, index=index, dtype=pd_dtype.base, name=name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you need to do this duplication, create the result, the if a tz, convert, no? (IOW you can just pass pd_dtype
, pd_dtype.base
is not very useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, you must initialize as TZ-naive for this to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems very hacky. can you find a more generic way to do this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hacky? How so? It's essentially doing what you suggested and using the same if-else
branch that was already there. Not really sure what you mean by "more generic" in this context.
ok, this is an actual bug in construction, where we construct incorrectly, then compensate in other places. removing this makes this just work: jreback@24f78cd here's the bug. I'll make a separate issue, then you can rebase on top of the fix. In master
with commit
see PR #14928 |
xref #14918 Author: Jeff Reback <[email protected]> Closes #14928 from jreback/timez_construction and squashes the following commits: 3dd8e99 [Jeff Reback] BUG: bug in Series construction from UTC
ok #14928 was merged. go ahead and rebase (you need to the code mostly back to what it was) |
Patches bug in read_msgpack in which Series categoricals were accidentally being constructed with a non-categorical dtype, resulting in an error. Closes pandas-devgh-14901.
76d4aab
to
b823ebe
Compare
@jreback : Change is nice and compact now and passing. Ready to merge if there are no other concerns. |
thanks! |
xref pandas-dev#14918 Author: Jeff Reback <[email protected]> Closes pandas-dev#14928 from jreback/timez_construction and squashes the following commits: 3dd8e99 [Jeff Reback] BUG: bug in Series construction from UTC
Patches bug in read_msgpack in which Series categoricals were accidentally being constructed with a non-categorical dtype, resulting in an error. Closes pandas-devgh-14901.
Patches bug in
read_msgpack
in whichSeries
categoricals were accidentally being constructed with a non-categorical dtype, resulting in an error.Closes #14901.