Skip to content

BUG: don't sort unique values from categoricals #9331

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 13, 2015
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v0.16.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,8 @@ Bug Fixes
SQLAlchemy type (:issue:`9083`).


- Items in ``Categorical.unique()`` (and ``s.unique()`` if ``s`` is of dtype ``category``) now appear in the order in which they are originally found, not in sorted order (:issue:`9331`). This is now consistent with the behavior for other dtypes in pandas.


- Fixed bug on bug endian platforms which produced incorrect results in ``StataReader`` (:issue:`8688`).

Expand Down
11 changes: 5 additions & 6 deletions pandas/core/categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -1386,17 +1386,16 @@ def unique(self):
"""
Return the unique values.

Unused categories are NOT returned.
Unused categories are NOT returned. Unique values are returned in order
of appearance.

Returns
-------
unique values : array
"""
unique_codes = np.unique(self.codes)
# for compatibility with normal unique, which has nan last
if unique_codes[0] == -1:
unique_codes[0:-1] = unique_codes[1:]
unique_codes[-1] = -1
from pandas.core.nanops import unique1d
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still need the nan insertion otherwise the -1 code would have meaning

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback This is fed into take_1d which fills -1 with fill_value... which is happily exactly what we want here to handle NaN. That behavior is unchanged from before (and still tested). So I think this is OK?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ahh yes that's right
ok then

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaik unique sorts nan last...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JanSchulz Nope, unique1d does not sort NaN last. I modified the test involving NaNs to make sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, s.unique() also not. Seems that sorting an nan handling is only done in numpy...

s = pd.Series([1,2,3,4,5,np.nan,6,1,2,3,4])
s.unique()
array([  1.,   2.,   3.,   4.,   5.,  nan,   6.])

Sorry for the noise...

# unlike np.unique, unique1d does not sort
unique_codes = unique1d(self.codes)
return take_1d(self.categories.values, unique_codes)

def equals(self, other):
Expand Down
7 changes: 5 additions & 2 deletions pandas/tests/test_categorical.py
Original file line number Diff line number Diff line change
Expand Up @@ -774,12 +774,15 @@ def test_unique(self):
exp = np.asarray(["a","b"])
res = cat.unique()
self.assert_numpy_array_equal(res, exp)

cat = Categorical(["a","b","a","a"], categories=["a","b","c"])
res = cat.unique()
self.assert_numpy_array_equal(res, exp)
cat = Categorical(["a","b","a", np.nan], categories=["a","b","c"])

# unique should not sort
cat = Categorical(["b", "b", np.nan, "a"], categories=["a","b","c"])
res = cat.unique()
exp = np.asarray(["a","b", np.nan], dtype=object)
exp = np.asarray(["b", np.nan, "a"], dtype=object)
self.assert_numpy_array_equal(res, exp)

def test_mode(self):
Expand Down