Skip to content

DOC/TST: add pd.unique doc-string & buggy return of Categorical #15939

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Apr 9, 2017

Conversation

jreback
Copy link
Contributor

@jreback jreback commented Apr 7, 2017

closes #9346

@jreback jreback added the Docs label Apr 7, 2017
@jreback jreback added this to the 0.20.0 milestone Apr 7, 2017
@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

I also added the already implicit guarantees of pd.unique (and some tests, though this is used everywhere). xref #9346

cc @jorisvandenbossche @shoyer

@jorisvandenbossche
Copy link
Member

+1 on finally just adding the order of appearance to the docstring

Regarding the categorical unique, we have had quite some discussions about that before. Do you recall the outcome of that? (whether it should return an array or not)

@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

This is from 0.19.2. So we already do this. The bug is that pd.unique didn't respect it.

In [2]: pd.Categorical(list('aabc')).unique()
Out[2]: 
[a, b, c]
Categories (3, object): [a, b, c]

In [3]: pd.unique(pd.Categorical(list('aabc')).unique())
Out[3]: array(['a', 'b', 'c'], dtype=object)

@jorisvandenbossche
Copy link
Member

This is from 0.19.2. So we already do this.

OK, then I misremember that (or maybe I remember the discussion about the datetime tz case ..).
Then it indeed makes some sense to be consistent there. It should probably then a plain Categorical instead of a Series, though, to be consistent.

Something else, the datetime with tz example are not correct I think, I get also an array for them.

@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

Ignore this one. Slightly wrong.

0.19.2

In [5]: Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')])).unique()
Out[5]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object)

In [6]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')]).unique()
Out[6]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [7]: pd.unique(Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')])))
Out[7]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

In [8]: pd.unique(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:           pd.Timestamp('20160101', tz='US/Eastern')]))
Out[8]: array(['2016-01-01T05:00:00.000000000'], dtype='datetime64[ns]')

master

In [1]: Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:    ...:           pd.Timestamp('20160101', tz='US/Eastern')])).unique()
Out[1]: array([Timestamp('2016-01-01 00:00:00-0500', tz='US/Eastern')], dtype=object)

In [2]: pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:    ...:           pd.Timestamp('20160101', tz='US/Eastern')]).unique()
   ...: 
Out[2]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [3]: pd.unique(Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:    ...:           pd.Timestamp('20160101', tz='US/Eastern')])))
   ...: 
Out[3]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

In [4]: pd.unique(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
   ...:    ...:           pd.Timestamp('20160101', tz='US/Eastern')]))
   ...: 
Out[4]: DatetimeIndex(['2016-01-01 00:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', freq=None)

I think these were completely non-tested before. I think that pd.unique (before my cleanup), was doing the wrong things for datetimetz aware.

@codecov
Copy link

codecov bot commented Apr 7, 2017

Codecov Report

Merging #15939 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15939      +/-   ##
==========================================
+ Coverage   90.97%   90.97%   +<.01%     
==========================================
  Files         145      145              
  Lines       49475    49480       +5     
==========================================
+ Hits        45008    45013       +5     
  Misses       4467     4467
Flag Coverage Δ
#multiple 88.73% <100%> (ø) ⬆️
#single 40.67% <37.5%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.57% <100%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3b53202...df7d073. Read the comment docs.

@codecov
Copy link

codecov bot commented Apr 7, 2017

Codecov Report

Merging #15939 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #15939      +/-   ##
==========================================
+ Coverage      91%      91%   +<.01%     
==========================================
  Files         145      145              
  Lines       49576    49581       +5     
==========================================
+ Hits        45118    45123       +5     
  Misses       4458     4458
Flag Coverage Δ
#multiple 88.77% <100%> (ø) ⬆️
#single 40.57% <44.44%> (-0.01%) ⬇️
Impacted Files Coverage Δ
pandas/core/base.py 95.51% <ø> (ø) ⬆️
pandas/core/categorical.py 95.85% <ø> (ø) ⬆️
pandas/core/series.py 94.89% <100%> (ø) ⬆️
pandas/core/algorithms.py 94.57% <100%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f35209e...dc81286. Read the comment docs.

@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

So I fixed this to be consistent. Though .unique itself is NOT very consistent (but we have known this for a while).

Its now consistent between calling Series(....).unique() and pd.unique(Series).

@jreback jreback force-pushed the unique branch 2 times, most recently from 774dbbc to f621ed3 Compare April 7, 2017 20:31
@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

@chris-b1 @jorisvandenbossche @shoyer

FYI, not trying to break new ground here at all, just make things consistent.

unique values.
- If the input is a Categorical dtype, the return is a Categorical
- If the input is an Index, the return is an Index
- If the input is a Series/ndarray, the return will be an ndarray
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good for now. In the future (pandas 2), the return type should probably be the same as the input (i.e., Series -> Series).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed. I'll have to start making a list of these types of changes somewhere.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should rather return an Array of the correct type instead of a Series (with a meaningless index in case of uniques), but that's a discussion for the other issue tracker :-) (and this consistently for all input object types / data types)

@chris-b1
Copy link
Contributor

chris-b1 commented Apr 7, 2017

Looks good, only question would on timezones, it looks you are taking the approach proposed in #15750 - returning an object array? I suppose that's the right answer.

@jreback
Copy link
Contributor Author

jreback commented Apr 7, 2017

chris that's a good point

it is somewhat orthogonal to this one though (.unique already returns an object array of the stamps)

@jreback
Copy link
Contributor Author

jreback commented Apr 8, 2017

@jorisvandenbossche if you'd have a look.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly some minor doc comments. One bigger comment regarding the return type for a Categorical (Categorical vs categorical Series)

"""
Hash table-based unique
Hash table-based unique. uniques are returned in order
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uniques -> Uniques

.. code-block:: ipython

# Series
In [5]: Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to 'Index' wrapped around the values (makes the line a little bit shorter)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also Series -> pd.Series

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

[a, b, c]
Categories (3, object): [a, b, c]

In [2]: pd.unique(pd.Series(pd.Categorical(list('aabc'))).unique())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

twice 'unique'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


# returns a Categorical
pd.Series(pd.Categorical(list('aabc'))).unique()
pd.unique(pd.Series(pd.Categorical(list('aabc'))).unique())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one (without the .unique(), so only with pd.unique(..)) still returns something different -> a categorical Series instead of Categorical (like the one above)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these both return Categoricals (see comment above)

Returns
-------
unique values.
- If the input is a Categorical dtype, the return is a Categorical
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those bullet points need one level of indentation (they are the 'description' part like in the parameters section)


if isinstance(values, ABCSeries):
from pandas import Series
return Series(values.values.unique(), name=values.name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As commented above, I think we shouldn't do the Series wrapping here

That would also be consistent with .unique(). With this branch:

In [13]: pd.unique(pd.Series(pd.Categorical(list('aabc'))))
Out[13]: 
0    a
1    b
2    c
dtype: category
Categories (3, object): [a, b, c]

In [14]: pd.Series(pd.Categorical(list('aabc'))).unique()
Out[14]: 
[a, b, c]
Categories (3, object): [a, b, c]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes fixed, so cat Series or Cat -> Cat, still CI -> CI.

@jorisvandenbossche
Copy link
Member

One more thing that surprised me is that the unique of a Categorical changes the actual 'categories':

In [17]: pd.Series(pd.Categorical(list('aacb')))
Out[17]: 
0    a
1    a
2    c
3    b
dtype: category
Categories (3, object): [a, b, c]

In [18]: pd.Series(pd.Categorical(list('aacb'))).unique()
Out[18]: 
[a, c, b]
Categories (3, object): [a, c, b]   <----- not the original order, but taken the order of appearance

So the actual values have the order of appearance (like unique says to do), that is OK, but also the actual categories do the same, of which I don't really understand the reasoning.
Note this is the current behaviour, so not related to this PR, just noticed it reviewing this. Did we have a good rationale for this behaviour? (otherwise can open a new issue for it)

@jreback
Copy link
Contributor Author

jreback commented Apr 9, 2017

@jorisvandenbossche from your last

So the actual values have the order of appearance (like unique says to do), that is OK, but also the actual categories do the same, of which I don't really understand the reasoning.
Note this is the current behaviour, so not related to this PR, just noticed it reviewing this. Did we have a good rationale for this behaviour? (otherwise can open a new issue for it)

This is documented in Categorical.unique. Essentially you get the order of the codes for an unordered categorical (an ordered categorical returns the category order).

This then preserves the ordering of appearance (rather than the category ordering, I think to preserve the pd.unique guarantee)

e.g.

same in 0.19.2 and now.

In [1]: pd.Categorical(list('bac')).unique()
Out[1]: 
[b, a, c]
Categories (3, object): [b, a, c]

In [2]: pd.Categorical(list('bac'), categories=list('abc')).unique()
Out[2]: 
[b, a, c]
Categories (3, object): [b, a, c]

In [3]: pd.Categorical(list('bac'), ordered=True).unique()
Out[3]: 
[b, a, c]
Categories (3, object): [a < b < c]

In [4]: pd.Categorical(list('bac'), categories=list('abc'), ordered=True).unique()
Out[4]: 
[b, a, c]
Categories (3, object): [a < b < c]

I can see how this works. Not really sure about it though. We can open an issue to discuss for 0.21.0. (this is unchanged in this PR)

.unique of Categorical/Series now returns Categorical
@jreback
Copy link
Contributor Author

jreback commented Apr 9, 2017

@jorisvandenbossche updated if you'd have a quick relook. Updated docs & added some examples.

@jorisvandenbossche jorisvandenbossche merged commit c3c60f0 into pandas-dev:master Apr 9, 2017
@jorisvandenbossche
Copy link
Member

@jreback Looks good! Did some minor changes (pd -> pandas, I don't think sphinx is aware of the pd shortcut), and merged. Thanks!

@jreback
Copy link
Contributor Author

jreback commented Apr 9, 2017

thanks!

@jreback
Copy link
Contributor Author

jreback commented Apr 9, 2017

@TomAugspurger
Copy link
Contributor

Do we need to add Categorical.unique to https://raw.githubusercontent.com/pandas-dev/pandas/master/doc/source/api.rst?

@jreback
Copy link
Contributor Author

jreback commented Apr 9, 2017

hmm yes i think that's why the links don't work
Series and pd.unique seem good

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC/TST: is pd.unique and the order it returns API?
5 participants