-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: don't sort unique values from categoricals #9331
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
fbe5879
to
ce16f54
Compare
this makes sense |
👍 thanks @shoyer |
cat = Categorical(["a","b","a", np.nan], categories=["a","b","c"]) | ||
res = cat.unique() | ||
exp = np.asarray(["a","b", np.nan], dtype=object) | ||
self.assert_numpy_array_equal(res, exp) | ||
|
||
cat = Categorical(["b", "a"], categories=["a", "b"], ordered=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A comment would be nice for the next coder :-)
ce16f54
to
ac7bc7e
Compare
Actually, does it even make sense to sort the unique values for ordered Categoricals? |
@shoyer I just wanted to ask a similar question: what if you want the unique values in order of appearance when you have an ordered categorical? That is now rather impractical to obtain. And another, thing: I don't think that |
@jorisvandenbossche Maybe |
I think we had that discussion before if I recall correctly, and the we decided to change it from categorical to ndarray ... (but I should look it up) Update: maybe I was wrong, I don't directly find it. There was a change in unique letting return only the "used" categories: #8937 |
I don't think [EDIT: on the other hand R returns a factor for |
from pandas.core.nanops import unique1d | ||
# unlike np.unique, unique1d does not sort | ||
unique_codes = unique1d(self.codes) | ||
if self.ordered: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this (and read the bugreports which have problems with this), the more I disagree with this change: if unique returns values in the order in which they appear in the rows, then this "contract" should be used. "order" only says that values in the categorical can be ordered, not that unique switches sorts the values. If one wants to have the categories in order, then take the categories
directly (e.g. mwaskom/seaborn#361)
The only think missing is the "get me all unique values in the same order as specified by the categories", and one could give that a new keyword or a new method
@JanSchulz for that last thing you mention (get me all unique values in the same order as specified by the categories), you can do:
|
The last step is probably costly, as |
Here is a cheap way of doing the last step (ignoring that you need to take out -1 for nan and then re-add them)
In fact I think unique should just do this. It will handle ordered and unordered the same. Namely the same order as the categories are in (for non-ordered this is incidental, but makes sense). In the actual Categorical
|
@jreback: nope, that's the sorting order as defined on the rows ("by apearance"), not the categories. Yours should be the implementation of normal The last step above would be So basically:
|
OK, I think we are in agreement that unique should not sort by default. I don't think that should be an option, either -- that's not found in any other unique methods, and it's better to support composability than to add flags. Either it should return a Categorical (which is fine) or it can return an ndarray (like it currently does). In the later case, it's easy to sort by turning things back into a categorical if desired: |
+1 for no sort option and +1 for returning an ndarray. On Thu, Jan 22, 2015 at 12:31 PM, Stephan Hoyer [email protected]
|
Also +1 for no sorting at all (ordered or unordered) For the return value: |
ac7bc7e
to
7b66059
Compare
OK, I went for the simple route of making unique never sort and return an ndarray. In principle, returning a categorical from unique is the right idea -- that would be the right decision if categorical was a real numpy dtype. But with categorical being mostly but not fully ndarray-like, it just causes more hassle for now. PS @stevesimmons you should setup your editor to trim trailing whitespace. It's the source of some spurious changes in this PR. Not a big deal, just something to keep in mind for next time. |
if unique_codes[0] == -1: | ||
unique_codes[0:-1] = unique_codes[1:] | ||
unique_codes[-1] = -1 | ||
from pandas.core.nanops import unique1d |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
still need the nan insertion otherwise the -1 code would have meaning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback This is fed into take_1d
which fills -1 with fill_value
... which is happily exactly what we want here to handle NaN. That behavior is unchanged from before (and still tested). So I think this is OK?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh yes that's right
ok then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaik unique sorts nan last...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JanSchulz Nope, unique1d does not sort NaN last. I modified the test involving NaNs to make sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, s.unique()
also not. Seems that sorting an nan handling is only done in numpy...
s = pd.Series([1,2,3,4,5,np.nan,6,1,2,3,4])
s.unique()
array([ 1., 2., 3., 4., 5., nan, 6.])
Sorry for the noise...
Just realized we don't actually guarantee ordering for unique values anywhere in general. Made a new issue about that: #9346. |
This should resolve the inconsistency mwaskom reported in GH9148. CC jreback TomAugspurger JanSchulz
7b66059
to
b787bf8
Compare
OK, I updated the change note based on @JanSchulz's suggestion. Any other comments before I merge this? I am inclined to stay with ndarray output for this change, though that is by no means a final decision -- we could change this to Categorical in the future if someone makes a case for that. |
yep this is consistent with the rest of pandas so lgtm |
ok for me! |
BUG: don't sort unique values from categoricals
Merged. @jorisvandenbossche agreed, may need to revisit this following #9346 (which will hopefully be before 0.16) |
This should resolve the inconsistency @mwaskom reported in #9148.
CC @jreback @TomAugspurger @JanSchulz