-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
DOC/TST: add pd.unique doc-string & buggy return of Categorical #15939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I also added the already implicit guarantees of |
+1 on finally just adding the order of appearance to the docstring Regarding the categorical unique, we have had quite some discussions about that before. Do you recall the outcome of that? (whether it should return an array or not) |
This is from 0.19.2. So we already do this. The bug is that
|
OK, then I misremember that (or maybe I remember the discussion about the datetime tz case ..). Something else, the datetime with tz example are not correct I think, I get also an array for them. |
Ignore this one. Slightly wrong. 0.19.2
master
I think these were completely non-tested before. I think that |
Codecov Report
@@ Coverage Diff @@
## master #15939 +/- ##
==========================================
+ Coverage 90.97% 90.97% +<.01%
==========================================
Files 145 145
Lines 49475 49480 +5
==========================================
+ Hits 45008 45013 +5
Misses 4467 4467
Continue to review full report at Codecov.
|
Codecov Report
@@ Coverage Diff @@
## master #15939 +/- ##
==========================================
+ Coverage 91% 91% +<.01%
==========================================
Files 145 145
Lines 49576 49581 +5
==========================================
+ Hits 45118 45123 +5
Misses 4458 4458
Continue to review full report at Codecov.
|
So I fixed this to be consistent. Though Its now consistent between calling |
774dbbc
to
f621ed3
Compare
@chris-b1 @jorisvandenbossche @shoyer FYI, not trying to break new ground here at all, just make things consistent. |
pandas/core/algorithms.py
Outdated
unique values. | ||
- If the input is a Categorical dtype, the return is a Categorical | ||
- If the input is an Index, the return is an Index | ||
- If the input is a Series/ndarray, the return will be an ndarray |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good for now. In the future (pandas 2), the return type should probably be the same as the input (i.e., Series -> Series).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed. I'll have to start making a list of these types of changes somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it should rather return an Array of the correct type instead of a Series (with a meaningless index in case of uniques), but that's a discussion for the other issue tracker :-) (and this consistently for all input object types / data types)
Looks good, only question would on timezones, it looks you are taking the approach proposed in #15750 - returning an object array? I suppose that's the right answer. |
chris that's a good point it is somewhat orthogonal to this one though (.unique already returns an object array of the stamps) |
@jorisvandenbossche if you'd have a look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly some minor doc comments. One bigger comment regarding the return type for a Categorical (Categorical vs categorical Series)
pandas/core/algorithms.py
Outdated
""" | ||
Hash table-based unique | ||
Hash table-based unique. uniques are returned in order |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uniques -> Uniques
doc/source/whatsnew/v0.20.0.txt
Outdated
.. code-block:: ipython | ||
|
||
# Series | ||
In [5]: Series(pd.Index([pd.Timestamp('20160101', tz='US/Eastern'), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You don't need to 'Index' wrapped around the values (makes the line a little bit shorter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also Series -> pd.Series
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/v0.20.0.txt
Outdated
[a, b, c] | ||
Categories (3, object): [a, b, c] | ||
|
||
In [2]: pd.unique(pd.Series(pd.Categorical(list('aabc'))).unique()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
twice 'unique'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
doc/source/whatsnew/v0.20.0.txt
Outdated
|
||
# returns a Categorical | ||
pd.Series(pd.Categorical(list('aabc'))).unique() | ||
pd.unique(pd.Series(pd.Categorical(list('aabc'))).unique()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one (without the .unique()
, so only with pd.unique(..)
) still returns something different -> a categorical Series instead of Categorical (like the one above)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these both return Categoricals
(see comment above)
pandas/core/algorithms.py
Outdated
Returns | ||
------- | ||
unique values. | ||
- If the input is a Categorical dtype, the return is a Categorical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those bullet points need one level of indentation (they are the 'description' part like in the parameters section)
pandas/core/algorithms.py
Outdated
|
||
if isinstance(values, ABCSeries): | ||
from pandas import Series | ||
return Series(values.values.unique(), name=values.name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As commented above, I think we shouldn't do the Series wrapping here
That would also be consistent with .unique()
. With this branch:
In [13]: pd.unique(pd.Series(pd.Categorical(list('aabc'))))
Out[13]:
0 a
1 b
2 c
dtype: category
Categories (3, object): [a, b, c]
In [14]: pd.Series(pd.Categorical(list('aabc'))).unique()
Out[14]:
[a, b, c]
Categories (3, object): [a, b, c]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes fixed, so cat Series or Cat -> Cat, still CI -> CI.
One more thing that surprised me is that the unique of a Categorical changes the actual 'categories':
So the actual values have the order of appearance (like |
@jorisvandenbossche from your last
This is documented in This then preserves the ordering of appearance (rather than the category ordering, I think to preserve the e.g. same in 0.19.2 and now.
I can see how this works. Not really sure about it though. We can open an issue to discuss for 0.21.0. (this is unchanged in this PR) |
.unique of Categorical/Series now returns Categorical
@jorisvandenbossche updated if you'd have a quick relook. Updated docs & added some examples. |
@jreback Looks good! Did some minor changes (pd -> pandas, I don't think sphinx is aware of the pd shortcut), and merged. Thanks! |
thanks! |
hmm links are not rendering in see also? |
Do we need to add |
hmm yes i think that's why the links don't work |
closes #9346