Skip to content

DOC/TST: is pd.unique and the order it returns API? #9346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
shoyer opened this issue Jan 23, 2015 · 22 comments · Fixed by #15939
Closed

DOC/TST: is pd.unique and the order it returns API? #9346

shoyer opened this issue Jan 23, 2015 · 22 comments · Fixed by #15939
Assignees

Comments

@shoyer
Copy link
Member

shoyer commented Jan 23, 2015

Looking at the API docs, neither the function pd.unique nor the order of the unique values from unique is mentioned. I would like to:

  1. Add pd.unique to API > General functions > Data manipulations
  2. Note that Series.unique() and unique() return values in order of appearance
  3. Add unit tests to verify the "order of appearance" nature unique (untested directly AFAICT, though I'm sure it's relied on implicitly)

Any objections? If not, I'll put together a PR.

This lack of documentation has caused some hesitation for relying on this functionality in Seaborn: mwaskom/seaborn#361

@jankatins
Copy link
Contributor

I would like to suggest that Series.unique() is changed to the same "API" as np.unique(), e.g. that the result is sorted.

Numpy actually specifies that the returned values are sorted: http://docs.scipy.org/doc/numpy/reference/generated/numpy.unique.html

Returns the sorted unique elements of an array. [...]

@jreback
Copy link
Contributor

jreback commented Jan 23, 2015

iirc pd.unique has a much greater perf and keeps things in the same order as compared to np.unique mainly by using a hashable impl

however I don't know the actual guarantees

@jankatins
Copy link
Contributor

Just as a referece:

In[11]: s = pd.Series([1, 3, 2, np.nan, 1,4])
In[12]: s.unique()
Out[12]: array([  1.,   3.,   2.,  nan,   4.])
In[13]: np.unique(s)
Out[13]: array([  1.,   2.,   3.,   4.,  nan])
In[14]: s = pd.Series(["a", "c", "b", np.nan, "a", "d"])
In[15]: s.unique()
Out[15]: array(['a', 'c', 'b', nan, 'd'], dtype=object)
In[16]: np.unique(s)
Out[16]: 
array(['a', 'b', 'c', 'd', 'nan'], 
      dtype='|S3')

@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2015

The underlying implementation is defined in hashtable.pyx, e.g.,: https://github.com/pydata/pandas/blob/3030bba1f151160f7e4cdfba8d3a00276b07be17/pandas/hashtable.pyx#L785

So it pretty clearly does not sort, just loops through the values (using a hashtable) and appends when not found.

I like the non-sorting behavior of pd.unique for three reasons:

  1. It's faster (sometimes even np.sort(pd.unique(values)) is faster than np.unique(values))
  2. It's more composable
  3. np.unique does not cover this use-case (unsorted unique values)

@TomAugspurger
Copy link
Contributor

Looking briefly at pd.unique, it dispatches on the dtype (float, int, other). We'll need to make sure that all of these guarantee the same things.

@shoyer is right about the looping (was just going to post that), and I agree that non-sorting is desirable since it's more composable (and faster).

@jankatins
Copy link
Contributor

Can someone give me an example of "more composable"? I buy the faster, but I haven't seen a case where a "sorted by appearance" sorting of unique is better than a sorted unique.

@shoyer
Copy link
Member Author

shoyer commented Jan 23, 2015

@JanSchulz np.unique and np.sort each do one thing (uniques or sorting). np.unique does two things (both finding uniques and sorting). If you don't want both, you are out of luck. This is just a specific example of a general principle for API/language design, e.g., see http://stackoverflow.com/a/2887190/809705

@jankatins
Copy link
Contributor

The problem I see here is that the current order is kind of random as it depends on the implementation of hashlib which drives unique. If that would change, unique would also change or, if your proposal would be chosen, would need a special sorting step which would be potentially costlier than simple sorting.

I also don't see much value in having a "in order of appearance" order of unique. My datasets usually are sorted by some (arbitrary) id and so "order of appearance" makes no sense and I would usually have to sort that anyway. And I don't get why I should sort my complete dataframe just to get a sorted unique and thereby a sorted plot like it is implied in the original seaborn bug report.

@shoyer
Copy link
Member Author

shoyer commented Jan 25, 2015

Just to be entirely clear, the order of unique here is entirely determined by the HashTable.unique method in pandas. There's no dependence on klib or external dependencies, and each of the three implementations (int, float, object) use almost identical logic, appending unique values in the order in which they are found.

Right now, the order from unique this is technically an implementation detail, but it's one that is assuredly relied on implicitly in user code. The only way it would break it is if we changed the underlying implementation of unique, which seems unlikely. I think we should make it API so users can safely rely on it and predict exactly what it will do.

@JanSchulz not quite sure what you're getting at there:

And I don't get why I should sort my complete dataframe just to get a sorted unique and thereby a sorted plot like it is implied in the original seaborn bug report.

You don't need to sort your complete frame -- you can use df.drop_duplicates().order(). For a series, you are right that s.unique().order() will not work... I'm not quite sure what we should do about that. But that is already existing behavior. Perhaps this would weigh in favor of making Categorical.unique return another categorical as discussed in #9331?

@jankatins
Copy link
Contributor

Ok, just saw it using the klib hastable and didn't look up the actual implementation. The implemenation seems to work also in other hastyble implementation (as long as a "this value is already in the table" indicator exists), but what if a new version of hastable brings with it a klib_get_unique_values() by itself which is faster but sorts differently?

Re the last thing: What I don't buy is that to get the unique values sorted in a special way (for plotting,...), you want/need to manipulate the dataframe. IMO that's a waste of CPU cycles, whether it is drop_duplicates() or sort or whatever... Much easier to pass in a "sorted as you want" array or sort the returned array... At least as long as len(dataframe) >> len(unique) which is probably true in the original seaborn bug.

Making categorical.unique() return a Categorical will not change that as the values are still in some particular order (or not).

@shoyer
Copy link
Member Author

shoyer commented Jan 25, 2015

@JanSchulz If you want sorted unique values, you can use np.sort(s.unique()) or even np.unique(s). Is that so bad? It's hardly wasteful of CPU cycles.

This works for every type other than categorical. If you want to add a pd.sort method which works like np.sort but also handles categoricals correctly, that I'm all for it.

what if a new version of hastable brings with it a klib_get_unique_values() by itself which is faster but sorts differently?

This may very well be possible but it seems pretty far-fetched to me.

@jankatins
Copy link
Contributor

I'm not so much arguing for sorting but for making it clear that the order is undefined and can change.

I do think that seaborn should sort, though.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

Here are my 2c.

  • let's document that the returned results of pd.unique are in the order of appearance (implicity the current API).
  • sorting is up to the user; its a mistake to make this sort by default, then you need all kinds of options to control the sort, and its much slower

so @shoyer top-of-pr is fine by me. Also ok with making the ordering 'undefined'.

@jorisvandenbossche
Copy link
Member

Looking at the above discussion, I think there are three possibilities for unique. We can say:

  1. that the order is undefined and cannot be relied upon (as it relies upon implementation details)
  2. that the output is sorted (as numpy does)
  3. that the output is in order of appearance (and we guarantee this)

I personally don't think that case 1) is a good idea, as people will always start relying on the output order in their code, even if we document that they shouldn't. So I would personally strive to have a well-defined order (and don't let users rely upon an implementation detail).
I am not familiar with the numpy implementation, but this is possibly the reason they sort the output? To have a well defined output order, without having to guarantee the output order of the unique algorithm itself?

So, if we think we can guarantee the output order with the Hashtable implementation, why not documenting this? So people can safely rely on this, as they already will do.

@jankatins
Copy link
Contributor

Just for the reference: numpy sorts since 2006. Not sure how to get to the earlier unique implementation: https://github.com/numpy/numpy/blob/0ab9ae36e8484d06f2a8e70c6fe45004e73408b4/numpy/lib/arraysetops.py

@kay1793
Copy link

kay1793 commented Feb 9, 2015

jumping in here...

  1. the name unique documents itself and it says nothing about order.
  2. because of that you could drop in a faster unique in the future without breaking any guarantees this way.
  3. of course users rely on unique right now, what else can they do? pandas has no explicit API for ordered unique separately from unique. If users had a way to ask for that explicitly they would. removing unique from the API and using that name instead for unique+ordered or unique+sorted is the wrong way to fix that issue.

API guarantees are not the same as documenting implementation details. confusing the two makes for bad maintainability and bad API design.

Just for the reference: numpy sorts since 2006.

... and they can't make user code run faster by simply dropping in the faster algorithm as a result, because they made unique and sorted unique the same thing. exactly.

EDIT: maybe it would be good to document that unique makes no guarantees on order, to prevent future confusion. leaving non-guarantees undocumented seems to trigger attempts to "fix it". @jorisvandenbossche

@jreback
Copy link
Contributor

jreback commented Feb 25, 2015

suggestion is to add a kind keyword, so currently should guarantee the order

@jorisvandenbossche jorisvandenbossche added this to the 0.16.0 milestone Feb 25, 2015
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 27, 2015 via email

@kay1793
Copy link

kay1793 commented Feb 28, 2015

you'll just bomb lots of users with useless false warnings.

pandas should leave unique as it is, and ask (only) users relying on undocumented behavior to fix their code now and use the new method/keyword. after that if a better algo comes along you put it in without breaking any code and make things faster for all users of unique without them doing anything.

@jreback
Copy link
Contributor

jreback commented Mar 5, 2015

@shoyer this is just docs at this point (maybe some tests as well), right?

@shoyer
Copy link
Member Author

shoyer commented Mar 6, 2015

@jreback Mostly yes, though I might also do some tweaks to pd.unique to ensure it works on all dtypes (e.g., on categorical)

@jreback
Copy link
Contributor

jreback commented Mar 6, 2015

ok, i'll move to 0.16.1 (don't want to hold us up on this issue). as its not going to be an api change, just docs,tests and such.

@jreback jreback modified the milestones: 0.16.1, 0.16.0 Mar 6, 2015
@jreback jreback modified the milestones: 0.17.0, 0.16.1 Apr 28, 2015
@shoyer shoyer self-assigned this Jun 29, 2015
@jreback jreback modified the milestones: Next Major Release, 0.17.0 Aug 20, 2015
jreback added a commit to jreback/pandas that referenced this issue Apr 7, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 7, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 7, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 7, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 7, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 8, 2017
jreback added a commit to jreback/pandas that referenced this issue Apr 9, 2017
jorisvandenbossche pushed a commit that referenced this issue Apr 9, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants