Skip to content

ENH/BUG: support Categorical in to_panel reshaping (GH8704) #8705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 2, 2014

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Nov 2, 2014

CLN: move block2d_to_blocknd support code to core/internal.py
TST/BUG: support Categorical reshaping via .unstack

closes #8704

@jreback jreback added Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Nov 2, 2014
@jreback jreback added this to the 0.15.1 milestone Nov 2, 2014
@jreback
Copy link
Contributor Author

jreback commented Nov 2, 2014

cc @immerrr

can you have a look. I think core/internals/BlockManager/iget needs to be slightly restructued. I am now allowing a CategoricalBlock in 3-d, BUT they are still represented as single blocks (e.g. NonConsolidatable).

So the problem is you can have multiple return blocks (e.g. _blknos is no longer unique).
I think instead of creating more blocks like I am doing in reshape_nd (for the Categorical case), really should just support a 2-d version of Categorical in a single block. (as the categories have to all be the same).

lmk what you think.

@jreback
Copy link
Contributor Author

jreback commented Nov 2, 2014

cc @shoyer
cc @JanSchulz

you might also find this interesting, as you have brought this up before (e.g. a 2-d Categorical type)

@shoyer
Copy link
Member

shoyer commented Nov 2, 2014

My hope was that we'll switch categoricals to dynd before too long, which would give us N-D categoricals for free. It looks like this works, although the fix is awkward. That said, getting the current categoricals fully N-dimensional really does not look that bad... I was most of the way there in #8012.

@immerrr
Copy link
Contributor

immerrr commented Nov 2, 2014

can you have a look. I think core/internals/BlockManager/iget needs to be slightly restructued. I am now allowing a CategoricalBlock in 3-d, BUT they are still represented as single blocks (e.g. NonConsolidatable).

I am not a fan of NonConsolidatable blocks in general. The idea may be great to get 90% of use cases working with low effort, but it adds a lot of semantic load, e.g. you have to think through every tiny bit of API to work with it properly (which is I guess the issue you hit with this PR).

I suppose we could make categorical a proper dtype (or try and take one from dynd) to convey the category information and that would render NonConsolidatable hack unnecessary as categorical blocks will stop consolidating with object ones because of dtype mismatch. As a bonus, categorical blocks that actually share the category info could be merged and nd-categoricals will no longer be an issue.

@immerrr
Copy link
Contributor

immerrr commented Nov 2, 2014

An interesting generalization of categorical dtype would be a record-like categorical, where each item across one of the axes can be of its own categorical type, since it is basically a MultiIndex.

@jreback
Copy link
Contributor Author

jreback commented Nov 2, 2014

  • dynd support will happen at some point (see the issue on adding integer na), but this won't really solve anything anyhow, still always have to represent them in pandas even if the underlying has a dtype support (though this wouldn't hurt)
  • NonConsolidatables are necessary for same dtype that for example DON't share the same categories, its really a sub-dtype distinction.

So

a) could disallow this until we fix it (e.g. you can create a panel but selecting from it is just doesn't work)
b) could add in a way to combine certain types of these blocks to form a 2-d version of categorical
(e.g. codes becomes 2-d, not sure how complicated this is)
c) add in a container type block that can hold multiple of these consolidatable blocks to avoid indexing issues

?

@immerrr
Copy link
Contributor

immerrr commented Nov 2, 2014

NonConsolidatables are necessary for same dtype that for example DON't share the same categories, its really a sub-dtype distinction.

My point is exactly to put category information into dtype definition so that ones with different categories become unequal.

@jreback
Copy link
Contributor Author

jreback commented Nov 2, 2014

@immerrr and I agree with you, but I don't think that's trivial / possible ATM (or maybe not ever, how does one define that these categories are different from others in a dtype-like? maybe hash it). The shape, sub-dtype is simply not good enough.

I think we have to work in the existing framework.

what do you think about the other ideas?

@immerrr
Copy link
Contributor

immerrr commented Nov 2, 2014

what do you think about the other ideas?

Of those I like b) most. Intentionally or not, to me it sounds pretty much making categorical blocks "in some way consolidatable".

how does one define that these categories are different from others in a dtype-like? maybe hash it

For the beginning, requiring a full match should be ok, e.g. (dtype1.cat.ordered == dtype2.cat.ordered) and np.array_equal(dtype1.cat.categories, dtype2.cat.categories). We can always figure out a smarter scheme afterwards.

CLN: move block2d_to_blocknd support code to core/internal.py
TST/BUG: support Categorical reshaping via .unstack
jreback added a commit that referenced this pull request Nov 2, 2014
ENH/BUG: support Categorical in to_panel reshaping (GH8704)
@jreback jreback merged commit 9d8b3a1 into pandas-dev:master Nov 2, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: reshape of categorical via unstack/to_panel
3 participants