BUG: don't sort unique values from categoricals #9331

shoyer · 2015-01-22T02:36:03Z

This should resolve the inconsistency @mwaskom reported in #9148.

CC @jreback @TomAugspurger @JanSchulz

jreback · 2015-01-22T02:39:10Z

this makes sense

mwaskom · 2015-01-22T02:44:03Z

👍 thanks @shoyer

jankatins · 2015-01-22T07:44:19Z

pandas/tests/test_categorical.py

        cat = Categorical(["a","b","a", np.nan], categories=["a","b","c"])
        res = cat.unique()
        exp = np.asarray(["a","b", np.nan], dtype=object)
        self.assert_numpy_array_equal(res, exp)

+        cat = Categorical(["b", "a"], categories=["a", "b"], ordered=False)


A comment would be nice for the next coder :-)

shoyer · 2015-01-22T08:21:28Z

Actually, does it even make sense to sort the unique values for ordered Categoricals? .unique() does not sort for any other dtype, and it's easy enough to chain .unique().order() if desired.

jorisvandenbossche · 2015-01-22T08:53:58Z

@shoyer I just wanted to ask a similar question: what if you want the unique values in order of appearance when you have an ordered categorical? That is now rather impractical to obtain.

And another, thing: I don't think that .unique().order() is possible to obtain the unique ordered categories? As .unique already returns a plain ndarray, and this does not know anymore about the sort order of the categorical.

shoyer · 2015-01-22T08:57:14Z

@jorisvandenbossche Maybe .unique() just should return another Categorical object then? It's straightforward enough to coerce to ndarray when desired via np.asarray.

jorisvandenbossche · 2015-01-22T08:59:17Z

I think we had that discussion before if I recall correctly, and the we decided to change it from categorical to ndarray ... (but I should look it up)

Update: maybe I was wrong, I don't directly find it. There was a change in unique letting return only the "used" categories: #8937
Update2: ok, the discussion was about that it first returned an Index, and that was changed to an array.

jankatins · 2015-01-22T10:23:29Z

I don't think .unique() should return anything else than what the rest of pandas returns so that there are as few surprises and needs for special casing Series/cols of categorical values.

[EDIT: on the other hand R returns a factor for unique(factor(...)): https://github.com//issues/8559#issuecomment-59416072]

jankatins · 2015-01-22T10:41:30Z

pandas/core/categorical.py

+        from pandas.core.nanops import unique1d
+        # unlike np.unique, unique1d does not sort
+        unique_codes = unique1d(self.codes)
+        if self.ordered:


The more I think about this (and read the bugreports which have problems with this), the more I disagree with this change: if unique returns values in the order in which they appear in the rows, then this "contract" should be used. "order" only says that values in the categorical can be ordered, not that unique switches sorts the values. If one wants to have the categories in order, then take the categories directly (e.g. mwaskom/seaborn#361)

The only think missing is the "get me all unique values in the same order as specified by the categories", and one could give that a new keyword or a new method

jorisvandenbossche · 2015-01-22T10:48:17Z

@JanSchulz for that last thing you mention (get me all unique values in the same order as specified by the categories), you can do:

In [44]: s = pd.Series(pd.Categorical(list('acbab'), list('abcde')))

In [45]: s
Out[45]:
0    a
1    c
2    b
3    a
4    b
dtype: category
Categories (5, object): [a < b < c < d < e]

In [46]: s.unique()
Out[46]: array(['a', 'b', 'c'], dtype=object)  # <----- this should then be [a, c, b] 

In [49]: s.cat.remove_unused_categories().cat.categories
Out[49]: Index([u'a', u'b', u'c'], dtype='object')   # <----- this stays as it is

jankatins · 2015-01-22T10:51:02Z

The last step is probably costly, as remove_unused_categories() does a complete redo of the factorization :-(

jreback · 2015-01-22T10:52:53Z

Here is a cheap way of doing the last step (ignoring that you need to take out -1 for nan and then re-add them)

In [25]: s.cat.categories.take(unique1d(s.cat.codes))
Out[25]: Index([u'a', u'c', u'b'], dtype='object')

In fact I think unique should just do this. It will handle ordered and unordered the same. Namely the same order as the categories are in (for non-ordered this is incidental, but makes sense).

In the actual Categorical

In [27]: c = s.values

In [28]: c.categories.take(unique1d(c.codes))
Out[28]: Index([u'a', u'c', u'b'], dtype='object')

jankatins · 2015-01-22T12:09:11Z

@jreback: nope, that's the sorting order as defined on the rows ("by apearance"), not the categories. Yours should be the implementation of normal unique().

The last step above would be s.cat.categories.take(np.unique(s.cat.codes)), as that would be ordered in the same way as categories.

So basically:

def unique(sort=False)
   """ ...
   sort: sort the unique values by the order specified in `categories`
   """
    from pandas.core.nanops import unique1d
    # unlike np.unique, unique1d and pandas does not sort by default
    unique_codes = unique1d(self.codes)
    if sort:
        unique_codes = np.sort(unique_codes)
        # for compatibility with normal unique, which has nan last
        if unique_codes[0] == -1:
            unique_codes[0:-1] = unique_codes[1:]
            unique_codes[-1] = -1

shoyer · 2015-01-22T18:31:33Z

OK, I think we are in agreement that unique should not sort by default. I don't think that should be an option, either -- that's not found in any other unique methods, and it's better to support composability than to add flags.

Either it should return a Categorical (which is fine) or it can return an ndarray (like it currently does). In the later case, it's easy to sort by turning things back into a categorical if desired: Categorical(c.unique(), c.categories).order(). Given the ease of doing this and the advantages of uniform signatures, I suppose we should probably stick with ndarray.

TomAugspurger · 2015-01-22T19:08:38Z

+1 for no sort option and +1 for returning an ndarray.

On Thu, Jan 22, 2015 at 12:31 PM, Stephan Hoyer [email protected]
wrote:

OK, I think we are in agreement that unique should not sort by default. I
don't think that should be an option, either -- that's not found in any
other unique methods, and it's better to support composability than to add
flags.

Either it should return a Categorical (which is fine) or it can return an
ndarray (like it currently does). In the later case, it's easy to sort by
turning things back into a categorical if desired: Categorical(c.unique(),
c.categories).order(). Given the ease of doing this and the advantages of
uniform signatures, I suppose we should probably stick with ndarray.

—
Reply to this email directly or view it on GitHub
#9331 (comment).

jorisvandenbossche · 2015-01-22T20:06:04Z

Also +1 for no sorting at all (ordered or unordered)

For the return value: Series.values also returns a Categorical instead of an array, so in that regard it would be somewhat consistent within the 'categorical series' to let Series.unique return a Categorical, but of course not consistent within unique

shoyer · 2015-01-23T06:58:54Z

OK, I went for the simple route of making unique never sort and return an ndarray.

In principle, returning a categorical from unique is the right idea -- that would be the right decision if categorical was a real numpy dtype. But with categorical being mostly but not fully ndarray-like, it just causes more hassle for now.

PS @stevesimmons you should setup your editor to trim trailing whitespace. It's the source of some spurious changes in this PR. Not a big deal, just something to keep in mind for next time.

jreback · 2015-01-23T07:04:35Z

pandas/core/categorical.py

-        if unique_codes[0] == -1:
-            unique_codes[0:-1] = unique_codes[1:]
-            unique_codes[-1] = -1
+        from pandas.core.nanops import unique1d


still need the nan insertion otherwise the -1 code would have meaning

@jreback This is fed into take_1d which fills -1 with fill_value... which is happily exactly what we want here to handle NaN. That behavior is unchanged from before (and still tested). So I think this is OK?

ahh yes that's right
ok then

Afaik unique sorts nan last...

@JanSchulz Nope, unique1d does not sort NaN last. I modified the test involving NaNs to make sure.

ok, s.unique() also not. Seems that sorting an nan handling is only done in numpy...

s = pd.Series([1,2,3,4,5,np.nan,6,1,2,3,4]) s.unique() array([ 1., 2., 3., 4., 5., nan, 6.])

Sorry for the noise...

shoyer · 2015-01-23T21:24:12Z

Just realized we don't actually guarantee ordering for unique values anywhere in general. Made a new issue about that: #9346.

This should resolve the inconsistency mwaskom reported in GH9148. CC jreback TomAugspurger JanSchulz

shoyer · 2015-02-12T04:09:32Z

OK, I updated the change note based on @JanSchulz's suggestion.

Any other comments before I merge this? I am inclined to stay with ndarray output for this change, though that is by no means a final decision -- we could change this to Categorical in the future if someone makes a case for that.

jreback · 2015-02-12T12:27:06Z

yep this is consistent with the rest of pandas so lgtm

jorisvandenbossche · 2015-02-12T20:51:41Z

ok for me!
(only small side note that your formulation in the docstring does depend on the outcome of #9346)

BUG: don't sort unique values from categoricals

shoyer · 2015-02-13T00:42:40Z

Merged.

@jorisvandenbossche agreed, may need to revisit this following #9346 (which will hopefully be before 0.16)

shoyer force-pushed the categorical-unique-order branch from fbe5879 to ce16f54 Compare January 22, 2015 02:36

shoyer added Categorical Categorical Data Type Enhancement labels Jan 22, 2015

shoyer added this to the 0.16.0 milestone Jan 22, 2015

jankatins reviewed Jan 22, 2015
View reviewed changes

shoyer force-pushed the categorical-unique-order branch from ce16f54 to ac7bc7e Compare January 22, 2015 07:59

jankatins reviewed Jan 22, 2015
View reviewed changes

shoyer force-pushed the categorical-unique-order branch from ac7bc7e to 7b66059 Compare January 23, 2015 06:51

shoyer changed the title ~~ENH: don't sort unique values from unordered categoricals~~ BUG: don't sort unique values from categoricals Jan 23, 2015

shoyer added Bug and removed Enhancement labels Jan 23, 2015

jreback reviewed Jan 23, 2015
View reviewed changes

shoyer mentioned this pull request Jan 25, 2015

DOC/TST: is pd.unique and the order it returns API? #9346

Closed

BUG: don't sort unique values from categoricals

b787bf8

This should resolve the inconsistency mwaskom reported in GH9148. CC jreback TomAugspurger JanSchulz

shoyer force-pushed the categorical-unique-order branch from 7b66059 to b787bf8 Compare February 12, 2015 03:33

shoyer added a commit that referenced this pull request Feb 13, 2015

Merge pull request #9331 from shoyer/categorical-unique-order

e266c3d

BUG: don't sort unique values from categoricals

shoyer merged commit e266c3d into pandas-dev:master Feb 13, 2015

shoyer deleted the categorical-unique-order branch February 13, 2015 00:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: don't sort unique values from categoricals #9331

BUG: don't sort unique values from categoricals #9331

shoyer commented Jan 22, 2015

jreback commented Jan 22, 2015

mwaskom commented Jan 22, 2015

jankatins Jan 22, 2015

shoyer commented Jan 22, 2015

jorisvandenbossche commented Jan 22, 2015

shoyer commented Jan 22, 2015

jorisvandenbossche commented Jan 22, 2015

jankatins commented Jan 22, 2015

jankatins Jan 22, 2015

jorisvandenbossche commented Jan 22, 2015

jankatins commented Jan 22, 2015

jreback commented Jan 22, 2015

jankatins commented Jan 22, 2015

shoyer commented Jan 22, 2015

TomAugspurger commented Jan 22, 2015

jorisvandenbossche commented Jan 22, 2015

shoyer commented Jan 23, 2015

jreback Jan 23, 2015

shoyer Jan 23, 2015

jreback Jan 23, 2015

jankatins Jan 23, 2015

shoyer Jan 23, 2015

jankatins Jan 23, 2015

shoyer commented Jan 23, 2015

shoyer commented Feb 12, 2015

jreback commented Feb 12, 2015

jorisvandenbossche commented Feb 12, 2015

shoyer commented Feb 13, 2015

BUG: don't sort unique values from categoricals #9331

BUG: don't sort unique values from categoricals #9331

Conversation

shoyer commented Jan 22, 2015

jreback commented Jan 22, 2015

mwaskom commented Jan 22, 2015

Choose a reason for hiding this comment

shoyer commented Jan 22, 2015

jorisvandenbossche commented Jan 22, 2015

shoyer commented Jan 22, 2015

jorisvandenbossche commented Jan 22, 2015

jankatins commented Jan 22, 2015

Choose a reason for hiding this comment

jorisvandenbossche commented Jan 22, 2015

jankatins commented Jan 22, 2015

jreback commented Jan 22, 2015

jankatins commented Jan 22, 2015

shoyer commented Jan 22, 2015

TomAugspurger commented Jan 22, 2015

jorisvandenbossche commented Jan 22, 2015

shoyer commented Jan 23, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shoyer commented Jan 23, 2015

shoyer commented Feb 12, 2015

jreback commented Feb 12, 2015

jorisvandenbossche commented Feb 12, 2015

shoyer commented Feb 13, 2015