-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: allow categoricals in msgpack #12573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
ctor_dtype = np_dtype | ||
if is_categorical_dtype(pd_dtype): | ||
# Series ctor doesn't take dtype with categorical | ||
ctor_dtype = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback - should the Series
constructor accept a Categorical
with dtype=category
? This part felt hacky.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it already does.
However in this case want to construct a pd.Categorical
as you can pass all of the attributes thru
In [1]: Series(list('aabc'),dtype='category')
Out[1]:
0 a
1 a
2 b
3 c
dtype: category
Categories (3, object): [a, b, c]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but you don't need this here at all. the dtype
is not present so its None
, the Series
will just take a Categorical
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I conclude that no change is required here in scope of this PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, this needs to be reworked a bit, seem my 2nd comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, that might be a bug as in this case it is ok to specify the dtype (but it should be ignored as this is pretty generic as it doesn't specify anything in particular about the dtype, so its compatible). can you open a separate issue for that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to start a new thread on this line of code.
So, if I understood, it shouldn't be necessary to special case the dtype
? If I remove the dtype=ctor_dtype
change (and leave it set to dtype=np_dtype
), then it does not work. I get:
ValueError: cannot specify a dtype with a Categorical unless dtype='category'
I don't currently understand what is being asked for here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to rebase on master, that's the issue that was just merged
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was after rebasing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so that's what I was first going to write. This is not a numpy-dtype and cannot be treated like one. Instead put this code in encode/decode
(e.g. check isinstance(obj, Categorical)
). Then you won't have this issue at all.
Done. Please let me know if it is to taste. |
@@ -23,6 +23,9 @@ Enhancements | |||
~~~~~~~~~~~~ | |||
|
|||
|
|||
- ``read_msgpack`` now supports categoricals (:issue:`12573`) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
support for serializing and de-serializing categoricalls with msgpack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
aa487d1
to
6a3cdce
Compare
@@ -341,6 +342,7 @@ def setUp(self): | |||
self.d['mixed'] = Series(data['E']) | |||
self.d['dt_tz_mixed'] = Series(data['F']) | |||
self.d['dt_tz'] = Series(data['G']) | |||
self.d['categorical'] = Series(data['H']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chris-b1 is this correct? Why is it Series
and not Categorical
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't make a difference here, e.g. df['c'] = Categorical(..)
and df['c'] = Series(Categorical(..))
do the same thing
ok #12574 has been merged. pls rebase/update. |
6a3cdce
to
59cf8fd
Compare
Rebased. |
@@ -30,7 +30,7 @@ Enhancements | |||
~~~~~~~~~~~~ | |||
|
|||
|
|||
|
|||
- support for serializing and de-serializing categoricalls with msgpack (:issue:`12573`) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support.........categoricals (sp)
d5d077b
to
ecab0d4
Compare
Wow, I think I managed to clear all of your comments. Everything suddenly made sense in the end. Please let me know if I missed something and if I should squash or reword my commits. |
I broke the NDFrame tests. Investigating. |
return {u'typ': u'categorical', | ||
u'klass': u(obj.__class__.__name__), | ||
u'name': getattr(obj, 'name', None), | ||
u'codes_dtype': u(obj._codes.dtype.name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you actually need to specify the codes/catgories
dtypes. Those will be automatic. IOW, don't use convert
here, just same codes : obj.codes
and categories : obj.categories
. These are recursive encoder/decoders.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
84c4b6c
to
a2a0693
Compare
return {u'typ': u'categorical', | ||
u'klass': u(obj.__class__.__name__), | ||
u'name': getattr(obj, 'name', None), | ||
u'codes': obj._codes, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob use obj.codes
here (same but _codes
is internal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
a2a0693
to
ae97751
Compare
'G': [Timestamp('20130603', tz='CET')] * 5 | ||
'G': [Timestamp('20130603', tz='CET')] * 5, | ||
'H': Categorical(['a', 'b', 'c', 'd', 'e']), | ||
'I': Categorical(['a', 'b', 'c', 'd', 'e'], ordered=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in a strange position here. I added this test - it passes, but I didn't add the relevant code in convert/unconvert to pass the ordered parameter through. What gives?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that check is not used at all! I was just going to tell you to take it out. a Categorical
is fully serialized/deserialized via encode/decode
the dtype
is NEVER category
. except when its a series
but that is already handled.
5205852
to
330d327
Compare
@pwaller can you rebase/update |
116e3d8
to
d1792d9
Compare
this looks pretty good code wise. can you squash everything. |
cc @kawochen |
fc7e1f1
to
52ea6e4
Compare
Squashed. |
@@ -262,6 +265,9 @@ def convert(values): | |||
v = values.ravel() | |||
|
|||
# convert object | |||
if is_categorical_dtype(values): | |||
return v | |||
|
|||
if dtype == np.object_: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change line 271 to:
elif is_object_dtype(dtype)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
52ea6e4
to
e7c4d9d
Compare
Rebased and comments addressed. |
I'm getting a test failure I'm not sure how to address and I don't have much time to look into it:
|
@@ -393,6 +402,16 @@ def encode(obj): | |||
u'dtype': u(obj.dtype.name), | |||
u'data': convert(obj.values), | |||
u'compress': compressor} | |||
|
|||
elif isinstance(obj, Categorical): | |||
return {u'typ': u'categorical', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be 'category'
e7c4d9d
to
c257021
Compare
Updated again. Thanks for your patience. |
To fix that error
|
DOC: support for categoricals in read_msgpack Add TestCategorical test cases Add Catecorical ordered=True ndframe test
Forced again. |
c257021
to
ff87282
Compare
ok looks good. ping when green. |
thanks! |
Supercedes #12191.
I've made a best-effort to rebase this against master.
It seems to work here.