Skip to content

Ordered vs. Unordered Categoricals #9148

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jseabold opened this issue Dec 24, 2014 · 23 comments · Fixed by #9622
Closed

Ordered vs. Unordered Categoricals #9148

jseabold opened this issue Dec 24, 2014 · 23 comments · Fixed by #9622
Labels
Categorical Categorical Data Type Usage Question
Milestone

Comments

@jseabold
Copy link
Contributor

I was looking at supporting Categorical objects in statsmodels [1] recently and again now in seaborn [2]. I think it would be great to support both ordered and unordered Categorical variables (via a flag or something). IIRC, there was some discussion about this. Is it possible? From what I could tell all Categorical types are assumed to have some order.

[1] statsmodels/statsmodels#2133
[2] mwaskom/seaborn#361

@jreback
Copy link
Contributor

jreback commented Dec 24, 2014

these are directly supported via Categorical.ordered

see a tutorial here:

http://nbviewer.ipython.org/github/jreback/pydata2014-pandas/blob/master/notebooks/Categorical.ipynb

@jreback jreback added Categorical Categorical Data Type Usage Question labels Dec 24, 2014
@jseabold
Copy link
Contributor Author

Ah, nice. Thanks!

@jseabold
Copy link
Contributor Author

What's the preferred idiom for checking for these new first-class citizen Categoricals? Checking for the cat attribute raises an error (i.e., hasattr(series, 'cat'), if it's not a Categorical. AFAICT, the old Categorical and the new Categorical type aren't compatible. So I guess that leaves a version check?

@jreback
Copy link
Contributor

jreback commented Dec 26, 2014

see also more full discussinon in #8814 as this needs a section in the docs

In [10]: df = DataFrame({'A' : list('abc'),'B':Series(list('def'),dtype='category')})

In [11]: df
Out[11]: 
   A  B
0  a  d
1  b  e
2  c  f

In [12]: df.dtypes
Out[12]: 
A      object
B    category
dtype: object

In [13]: df.select_dtypes(include=['category'])
Out[13]: 
   B
0  d
1  e
2  f

In [18]: pandas.core.common.is_categorical_dtype(df['B'])
Out[18]: True

In [19]: pandas.core.common.is_categorical_dtype(df['A'])
Out[19]: False

In [20]: df['A'].dtype.name
Out[20]: 'object'

In [21]: df['B'].dtype.name
Out[21]: 'category'

@mwaskom
Copy link
Contributor

mwaskom commented Jan 21, 2015

Not sure if this is the right place for this, but the behavior of unordered categorical surprised me a bit:

import string
import numpy as np
import pandas as pd

cat = pd.Categorical(list(string.uppercase), ordered=False)
num = np.random.randn(26)
df = pd.DataFrame(dict(cat=cat, num=num))

df.sort("num").cat.unique()

This returns

array(['A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M',
       'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z'], dtype=object)

I was expecting to get the order of letters that now appears in the DataFrame (which would happen if cat just had a regular object dtype), rather than the original alphabetical order.

This is relevant because I want to leverage Categorical objects in seaborn to get plot elements ordered in good ways. I was hoping that Categorical.unique() would give me the category order, if it exists, and the item order in the DataFrame if not.

Is this the correct behavior? (Probably). Is there a way to get what I want out of the object?

@jreback
Copy link
Contributor

jreback commented Jan 21, 2015

In [24]: np.array(df.sort('num')['cat'].values)
Out[24]:
array(['K', 'J', 'X', 'P', 'Q', 'U', 'O', 'L', 'H', 'N', 'T', 'R', 'A',
       'S', 'E', 'W', 'M', 'G', 'B', 'V', 'Z', 'Y', 'I', 'C', 'F', 'D'], dtype=object)

@mwaskom
Copy link
Contributor

mwaskom commented Jan 21, 2015

That gives me the array of values, but I want the array of unique values, and if I use np.unique() it will sort

@mwaskom
Copy link
Contributor

mwaskom commented Jan 21, 2015

Also I'd have to handle the "is this column a categorical dtype series, and if so, is it already ordered?" logic on my end, where it would be preferable to get a consistent answer from the series itself.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2015

unique by definition is not ordered (and arbitrary)

@mwaskom
Copy link
Contributor

mwaskom commented Jan 21, 2015

When I use an ordered category object, it is:

cat = pd.Categorical(["b", "c", "c", "a", "b", "c"], categories=["c", "b", "a"])
cat.unique()

gives

array(['c', 'b', 'a'], dtype=object)

@jreback
Copy link
Contributor

jreback commented Jan 21, 2015

In [34]: c = df['cat'].values

In [35]: c.reorder_categories(c.take(df.sort('num').index.unique())).categories
Out[35]: Index([u'C', u'U', u'H', u'A', u'O', u'E', u'X', u'F', u'D', u'S', u'L', u'Y', u'M', u'V', u'G', u'B', u'N', u'R', u'W', u'Q', u'I', u'J', u'K', u'T', u'P', u'Z'], dtype='object')

@mwaskom
Copy link
Contributor

mwaskom commented Jan 21, 2015

From the docstring:

ordered : boolean, optional
    Whether or not this categorical is treated as a ordered categorical. If not given,
    the resulting categorical will be ordered if values can be sorted.

This caused me to think that setting ordered to False would not lexicographically sort the levels, but would instead fall back to the normal unique() behavior of giving me the values in the order they appear in the dataframe

@TomAugspurger
Copy link
Contributor

Maybe

df.groupby('cat').num.first().order().index

@jreback
Copy link
Contributor

jreback commented Jan 21, 2015

actually @TomAugspurger soln prob the best.

You are uniquify on one column, but ordering by another.

@TomAugspurger
Copy link
Contributor

Do we specify the order of values from Series.unique anywhere? I thought it was arbitrary, but I could be wrong.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2015

no, unique is by definition un-ordered. The ordering is undefined.

@jreback
Copy link
Contributor

jreback commented Jan 21, 2015

An ordered Categorical is just that the sort order is defined differently, namely, the order given when the categorical is constructed. This is the purpose of reorder_categoricals, e.g. to construct a new ordering (which you give it).

@mwaskom
Copy link
Contributor

mwaskom commented Jan 21, 2015

OK so unique is not defined but is there any reason why unordered categoricals do something completely different (return the lexicographically sorted unique values) from all other dtypes (return the unique values in the order they appear in the dataframe)?

None of the proposed approaches are at all helpful for my usecase, which I don't think is that crazy.

@TomAugspurger
Copy link
Contributor

Agreed that your usecase is not crazy. My solution is a bit hacky.

Looking at the code, we sort by category codes with np.unique, which is sorted. I wonder why we did that instead of our unique.

@mwaskom
Copy link
Contributor

mwaskom commented Jan 21, 2015

Here's a more concrete example of what I'm talking about: http://nbviewer.ipython.org/gist/mwaskom/7b464a7217858cbdcedd

The basic idea is that the sorting of the DataFrame is external to seaborn, but the determination of the order in which we should plot the grouping variable is internal. I want to be able to say "use Categorical wherever possible" because it gets you 1) automatic detection of what orientation the plot should be drawn in and 2) the order you expect when the categorical is ordered. But that won't work here, when we want the order of categories to change as the dataframe gets sorted by different numeric variables. This leads to confusing information about when to use category vs object.

@shoyer
Copy link
Member

shoyer commented Jan 21, 2015

Almost certainly, we should be using pd.unique instead of np.unique internally here.

@mwaskom pd.unique will respect insertion order, but unfortunately does handle ordered categoricals right. But as a work around you could use it for unordered categoricals (for now).

@shoyer
Copy link
Member

shoyer commented Jan 22, 2015

I'm going to reopen this issue until we clarify what the solution here should look like.

My earlier comment was somewhat confused.

@jorisvandenbossche
Copy link
Member

See also #9611

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Categorical Categorical Data Type Usage Question
Projects
None yet
6 participants