DOC/TST: add pd.unique doc-string & buggy return of Categorical #15939

jreback · 2017-04-07T13:21:42Z

jreback · 2017-04-07T13:22:19Z

I also added the already implicit guarantees of pd.unique (and some tests, though this is used everywhere). xref #9346

cc @jorisvandenbossche @shoyer

jorisvandenbossche · 2017-04-07T13:29:21Z

+1 on finally just adding the order of appearance to the docstring

Regarding the categorical unique, we have had quite some discussions about that before. Do you recall the outcome of that? (whether it should return an array or not)

jreback · 2017-04-07T13:31:01Z

This is from 0.19.2. So we already do this. The bug is that pd.unique didn't respect it.

In [2]: pd.Categorical(list('aabc')).unique()
Out[2]: 
[a, b, c]
Categories (3, object): [a, b, c]

In [3]: pd.unique(pd.Categorical(list('aabc')).unique())
Out[3]: array(['a', 'b', 'c'], dtype=object)

jorisvandenbossche · 2017-04-07T13:45:24Z

This is from 0.19.2. So we already do this.

OK, then I misremember that (or maybe I remember the discussion about the datetime tz case ..).
Then it indeed makes some sense to be consistent there. It should probably then a plain Categorical instead of a Series, though, to be consistent.

Something else, the datetime with tz example are not correct I think, I get also an array for them.

jreback · 2017-04-07T14:10:53Z

Ignore this one. Slightly wrong.

0.19.2

In [5]: Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')])).unique()
Out[5]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object)

In [6]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')]).unique()
Out[6]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [7]: pd.unique(Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')])))
Out[7]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

In [8]: pd.unique(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')]))
Out[8]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

master

In [1]: Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:    ...:           pd.Timestamp('20160101', tz='US/Eastern')])).unique()
Out[1]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object)

In [2]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:    ...:           pd.Timestamp('20160101', tz='US/Eastern')]).unique()
   ...: 
Out[2]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [3]: pd.unique(Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:    ...:           pd.Timestamp('20160101', tz='US/Eastern')])))
   ...: 
Out[3]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [4]: pd.unique(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:    ...:           pd.Timestamp('20160101', tz='US/Eastern')]))
   ...: 
Out[4]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

I think these were completely non-tested before. I think that pd.unique (before my cleanup), was doing the wrong things for datetimetz aware.

codecov · 2017-04-07T14:53:17Z

Codecov Report

Merging #15939 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15939      +/-   ##
==========================================
+ Coverage   90.97%   90.97%   +<.01%     
==========================================
  Files         145      145              
  Lines       49475    49480       +5     
==========================================
+ Hits        45008    45013       +5     
  Misses       4467     4467

Flag	Coverage Δ
#multiple	`88.73% <100%> (ø)`	⬆️
#single	`40.67% <37.5%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/algorithms.py	`94.57% <100%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b53202...df7d073. Read the comment docs.

codecov · 2017-04-07T14:53:27Z

Codecov Report

Merging #15939 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master   #15939      +/-   ##
==========================================
+ Coverage      91%      91%   +<.01%     
==========================================
  Files         145      145              
  Lines       49576    49581       +5     
==========================================
+ Hits        45118    45123       +5     
  Misses       4458     4458

Flag	Coverage Δ
#multiple	`88.77% <100%> (ø)`	⬆️
#single	`40.57% <44.44%> (-0.01%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/base.py	`95.51% <ø> (ø)`	⬆️
pandas/core/categorical.py	`95.85% <ø> (ø)`	⬆️
pandas/core/series.py	`94.89% <100%> (ø)`	⬆️
pandas/core/algorithms.py	`94.57% <100%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f35209e...dc81286. Read the comment docs.

jreback · 2017-04-07T14:54:15Z

So I fixed this to be consistent. Though .unique itself is NOT very consistent (but we have known this for a while).

Its now consistent between calling Series(....).unique() and pd.unique(Series).

jreback · 2017-04-07T20:31:55Z

@chris-b1 @jorisvandenbossche @shoyer

FYI, not trying to break new ground here at all, just make things consistent.

shoyer · 2017-04-07T21:51:22Z

pandas/core/algorithms.py

+    unique values.
+    - If the input is a Categorical dtype, the return is a Categorical
+    - If the input is an Index, the return is an Index
+    - If the input is a Series/ndarray, the return will be an ndarray


This is good for now. In the future (pandas 2), the return type should probably be the same as the input (i.e., Series -> Series).

agreed. I'll have to start making a list of these types of changes somewhere.

I think it should rather return an Array of the correct type instead of a Series (with a meaningless index in case of uniques), but that's a discussion for the other issue tracker :-) (and this consistently for all input object types / data types)

chris-b1 · 2017-04-07T23:15:24Z

Looks good, only question would on timezones, it looks you are taking the approach proposed in #15750 - returning an object array? I suppose that's the right answer.

jreback · 2017-04-07T23:20:46Z

chris that's a good point

it is somewhat orthogonal to this one though (.unique already returns an object array of the stamps)

jreback · 2017-04-08T13:38:55Z

@jorisvandenbossche if you'd have a look.

jorisvandenbossche

Mainly some minor doc comments. One bigger comment regarding the return type for a Categorical (Categorical vs categorical Series)

jorisvandenbossche · 2017-04-09T09:19:06Z

pandas/core/algorithms.py

    """
-    Hash table-based unique
+    Hash table-based unique. uniques are returned in order


uniques -> Uniques

jorisvandenbossche · 2017-04-09T09:22:02Z

doc/source/whatsnew/v0.20.0.txt

+  .. code-block:: ipython
+
+     # Series
+     In [5]: Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),


You don't need to 'Index' wrapped around the values (makes the line a little bit shorter)

also Series -> pd.Series

jorisvandenbossche · 2017-04-09T09:24:39Z

doc/source/whatsnew/v0.20.0.txt

+     [a, b, c]
+     Categories (3, object): [a, b, c]
+
+     In [2]: pd.unique(pd.Series(pd.Categorical(list('aabc'))).unique())


twice 'unique'

jorisvandenbossche · 2017-04-09T09:26:37Z

doc/source/whatsnew/v0.20.0.txt

+
+     # returns a Categorical
+     pd.Series(pd.Categorical(list('aabc'))).unique()
+     pd.unique(pd.Series(pd.Categorical(list('aabc'))).unique())


This one (without the .unique(), so only with pd.unique(..)) still returns something different -> a categorical Series instead of Categorical (like the one above)

these both return Categoricals (see comment above)

jorisvandenbossche · 2017-04-09T09:26:46Z

pandas/core/algorithms.py

+    Returns
+    -------
+    unique values.
+    - If the input is a Categorical dtype, the return is a Categorical


Those bullet points need one level of indentation (they are the 'description' part like in the parameters section)

jorisvandenbossche · 2017-04-09T09:28:19Z

pandas/core/algorithms.py

+
+        if isinstance(values, ABCSeries):
+            from pandas import Series
+            return Series(values.values.unique(), name=values.name)


As commented above, I think we shouldn't do the Series wrapping here

That would also be consistent with .unique(). With this branch:

In [13]: pd.unique(pd.Series(pd.Categorical(list('aabc')))) Out[13]: 0 a 1 b 2 c dtype: category Categories (3, object): [a, b, c] In [14]: pd.Series(pd.Categorical(list('aabc'))).unique() Out[14]: [a, b, c] Categories (3, object): [a, b, c]

yes fixed, so cat Series or Cat -> Cat, still CI -> CI.

jorisvandenbossche · 2017-04-09T09:52:47Z

One more thing that surprised me is that the unique of a Categorical changes the actual 'categories':

In [17]: pd.Series(pd.Categorical(list('aacb')))
Out[17]: 
0    a
1    a
2    c
3    b
dtype: category
Categories (3, object): [a, b, c]

In [18]: pd.Series(pd.Categorical(list('aacb'))).unique()
Out[18]: 
[a, c, b]
Categories (3, object): [a, c, b]   <----- not the original order, but taken the order of appearance

So the actual values have the order of appearance (like unique says to do), that is OK, but also the actual categories do the same, of which I don't really understand the reasoning.
Note this is the current behaviour, so not related to this PR, just noticed it reviewing this. Did we have a good rationale for this behaviour? (otherwise can open a new issue for it)

closes pandas-dev#9346

jreback · 2017-04-09T13:26:12Z

@jorisvandenbossche from your last

So the actual values have the order of appearance (like unique says to do), that is OK, but also the actual categories do the same, of which I don't really understand the reasoning.
Note this is the current behaviour, so not related to this PR, just noticed it reviewing this. Did we have a good rationale for this behaviour? (otherwise can open a new issue for it)

This is documented in Categorical.unique. Essentially you get the order of the codes for an unordered categorical (an ordered categorical returns the category order).

This then preserves the ordering of appearance (rather than the category ordering, I think to preserve the pd.unique guarantee)

e.g.

same in 0.19.2 and now.

In [1]: pd.Categorical(list('bac')).unique()
Out[1]: 
[b, a, c]
Categories (3, object): [b, a, c]

In [2]: pd.Categorical(list('bac'), categories=list('abc')).unique()
Out[2]: 
[b, a, c]
Categories (3, object): [b, a, c]

In [3]: pd.Categorical(list('bac'), ordered=True).unique()
Out[3]: 
[b, a, c]
Categories (3, object): [a < b < c]

In [4]: pd.Categorical(list('bac'), categories=list('abc'), ordered=True).unique()
Out[4]: 
[b, a, c]
Categories (3, object): [a < b < c]

I can see how this works. Not really sure about it though. We can open an issue to discuss for 0.21.0. (this is unchanged in this PR)

.unique of Categorical/Series now returns Categorical

jreback · 2017-04-09T13:40:47Z

@jorisvandenbossche updated if you'd have a quick relook. Updated docs & added some examples.

jorisvandenbossche · 2017-04-09T15:30:05Z

@jreback Looks good! Did some minor changes (pd -> pandas, I don't think sphinx is aware of the pd shortcut), and merged. Thanks!

jreback · 2017-04-09T15:50:18Z

thanks!

jreback · 2017-04-09T16:10:28Z

hmm links are not rendering in see also?

http://pandas-docs.github.io/pandas-docs-travis/generated/pandas.Series.unique.html?highlight=unique#pandas.Series.unique

TomAugspurger · 2017-04-09T17:13:34Z

Do we need to add Categorical.unique to https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/source/api.rst?

jreback · 2017-04-09T17:22:28Z

hmm yes i think that's why the links don't work
Series and pd.unique seem good

jreback added the Docs label Apr 7, 2017

jreback added this to the 0.20.0 milestone Apr 7, 2017

jreback force-pushed the unique branch from 1b7db7c to b9ca6dd Compare April 7, 2017 13:22

jreback force-pushed the unique branch 2 times, most recently from 774dbbc to f621ed3 Compare April 7, 2017 20:31

shoyer approved these changes Apr 7, 2017

View reviewed changes

jreback force-pushed the unique branch from f621ed3 to db5b0ae Compare April 7, 2017 22:25

jreback force-pushed the unique branch from db5b0ae to f4aff8a Compare April 8, 2017 13:38

jreback requested a review from jorisvandenbossche April 8, 2017 13:38

jorisvandenbossche reviewed Apr 9, 2017

View reviewed changes

jreback added 7 commits April 9, 2017 08:58

DOC/TST: add pd.unique doc-string & buggy return of Categorical

d3670d0

closes pandas-dev#9346

more tests

c5b2271

fix tz-aware unique

41c2d1b

docs

3e2ae12

tests for CI

e2b3867

TST: fix test on windows

c725c7f

doc fixes

a046dde

more doc fixes

220a86b

fix docs as per review

7d05b50

.unique of Categorical/Series now returns Categorical

jreback force-pushed the unique branch from f4aff8a to 7d05b50 Compare April 9, 2017 13:40

small doc fixes

dc81286

jorisvandenbossche merged commit c3c60f0 into pandas-dev:master Apr 9, 2017

jorisvandenbossche mentioned this pull request Nov 14, 2017

Any categorical dtype object's .unique() changes categories #18291

Closed

Uh oh!

DOC/TST: add pd.unique doc-string & buggy return of Categorical #15939

DOC/TST: add pd.unique doc-string & buggy return of Categorical #15939

Uh oh!

Conversation

jreback commented Apr 7, 2017

Uh oh!

jreback commented Apr 7, 2017

Uh oh!

jorisvandenbossche commented Apr 7, 2017

Uh oh!

jreback commented Apr 7, 2017

Uh oh!

jorisvandenbossche commented Apr 7, 2017

Uh oh!

jreback commented Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Apr 7, 2017

Codecov Report

Uh oh!

codecov bot commented Apr 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

jreback commented Apr 7, 2017

Uh oh!

jreback commented Apr 7, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chris-b1 commented Apr 7, 2017

Uh oh!

jreback commented Apr 7, 2017

Uh oh!

jreback commented Apr 8, 2017

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Apr 9, 2017

Uh oh!

jreback commented Apr 9, 2017

Uh oh!

jreback commented Apr 9, 2017

Uh oh!

jorisvandenbossche commented Apr 9, 2017

Uh oh!

jreback commented Apr 9, 2017

Uh oh!

jreback commented Apr 9, 2017

Uh oh!

TomAugspurger commented Apr 9, 2017

Uh oh!

jreback commented Apr 7, 2017 •

edited

Loading

codecov bot commented Apr 7, 2017 •

edited

Loading