-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Ordered vs. Unordered Categoricals #9148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
these are directly supported via see a tutorial here: http://nbviewer.ipython.org/github/jreback/pydata2014-pandas/blob/master/notebooks/Categorical.ipynb |
Ah, nice. Thanks! |
What's the preferred idiom for checking for these new first-class citizen Categoricals? Checking for the |
see also more full discussinon in #8814 as this needs a section in the docs
|
Not sure if this is the right place for this, but the behavior of unordered categorical surprised me a bit: import string
import numpy as np
import pandas as pd
cat = pd.Categorical(list(string.uppercase), ordered=False)
num = np.random.randn(26)
df = pd.DataFrame(dict(cat=cat, num=num))
df.sort("num").cat.unique() This returns
I was expecting to get the order of letters that now appears in the DataFrame (which would happen if This is relevant because I want to leverage Categorical objects in seaborn to get plot elements ordered in good ways. I was hoping that Is this the correct behavior? (Probably). Is there a way to get what I want out of the object? |
|
That gives me the array of values, but I want the array of unique values, and if I use |
Also I'd have to handle the "is this column a categorical dtype series, and if so, is it already ordered?" logic on my end, where it would be preferable to get a consistent answer from the series itself. |
|
When I use an ordered category object, it is: cat = pd.Categorical(["b", "c", "c", "a", "b", "c"], categories=["c", "b", "a"])
cat.unique() gives
|
|
From the docstring:
This caused me to think that setting |
Maybe df.groupby('cat').num.first().order().index |
actually @TomAugspurger soln prob the best. You are uniquify on one column, but ordering by another. |
Do we specify the order of values from |
no, unique is by definition un-ordered. The ordering is undefined. |
An ordered Categorical is just that the sort order is defined differently, namely, the order given when the categorical is constructed. This is the purpose of |
OK so None of the proposed approaches are at all helpful for my usecase, which I don't think is that crazy. |
Agreed that your usecase is not crazy. My solution is a bit hacky. Looking at the code, we sort by category codes with |
Here's a more concrete example of what I'm talking about: http://nbviewer.ipython.org/gist/mwaskom/7b464a7217858cbdcedd The basic idea is that the sorting of the DataFrame is external to seaborn, but the determination of the order in which we should plot the grouping variable is internal. I want to be able to say "use Categorical wherever possible" because it gets you 1) automatic detection of what orientation the plot should be drawn in and 2) the order you expect when the categorical is ordered. But that won't work here, when we want the order of categories to change as the dataframe gets sorted by different numeric variables. This leads to confusing information about when to use |
Almost certainly, we should be using @mwaskom |
I'm going to reopen this issue until we clarify what the solution here should look like. My earlier comment was somewhat confused. |
See also #9611 |
I was looking at supporting Categorical objects in statsmodels [1] recently and again now in seaborn [2]. I think it would be great to support both ordered and unordered Categorical variables (via a flag or something). IIRC, there was some discussion about this. Is it possible? From what I could tell all Categorical types are assumed to have some order.
[1] statsmodels/statsmodels#2133
[2] mwaskom/seaborn#361
The text was updated successfully, but these errors were encountered: