-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
less than, greater than cuts on categorical don't follow order #9836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
A subsetting operation does selection which is ordered by the index of the frame.
Note if you had a |
@jreback I don't fully understand your explanation of the behaviour. I think the point of having an ordered categorical is that this order is used in comparisons ? |
Relabeled the issue until we agree on that :-) So the point is this:
Should the categorical value |
I agree with @jorisvandenbossche: in the last example, the comparison Not sure what happened here, but the docs also show this "wrong" behavior: http://pandas.pydata.org/pandas-docs/version/0.15.2/categorical.html#comparisons On the other hand, with the |
The implementation of categories in pandas doesn't distinguish a (single) catgeory from its underlying value:
I expect users sometimes want to subset based on value and other times by order (just as loc/iloc selects based on label or location) but the way a category and its value are lumped into one makes it difficult to support both. If you did have singleton categories as objects, you would need to have them reference the parent set of categories they belong to in order to be meaningful. |
@kay1793: ok, good catch, this should account for |
That's true. If the cat series comparison adhered to order, users could still get a value based comparison |
@kay1793 Indeed, but that is the limitation of numpy we have to live with at the moment (and that is also the reason I used |
@JanSchulz regarding the question on "raising for ordered=False" -> at the moment we said that sorting such a category should work. So maybe it is a bit unconsistent that a direct comparison of greater/smaller than would raise? |
That is, unless you think what olga is asking for should work as:
or similar, rather then forcing anyone who wants |
It would indeed be an idea to have some kind of 'categorical singleton' (just a value but with a category dtype attached), however, still, I think comparing with a 'normal' value should always try to see this value as a category. @JanSchulz about the raising question: R does indeed raise on comparing factors ( |
@jorisvandenbossche I think sorting and comparing should be different here: I don't mind if my sorting succeeds when it shouldn't but comparing when it isn't comparable is not so nice (I think along this "example": "I can sort blue and green stones, but I can't compare them") |
I'm currently trying to find the problem. I've a few testcases, which currently raise, so lets see... |
@JanSchulz I think that is a nice analogy for the difference between sorting and comparing! I am convinced :-) |
@JanSchulz If the user has defined how they want blue and green to be sorted, then less than/greater than should make sense :) |
Jikes:
|
??? how is that possible? |
There is a After that line, |
Ok, with #9848 this is now:
and
|
As far as I can understand the codepaths and the output of |
closed by #9848 |
Now that the behaviour is changed, what is the officially sanctioned way for doing comparisons based |
What do you mean with the "underlying value"? The behaviour it was before? |
not the behaviour, but what it enabled. yes, I mean the "label" value. I'm thinking in particular of having |
Well, maybe you want to use .codes directly? What is the actual use case? On Sun, Apr 12, 2015 at 2:10 PM, kay1793 [email protected] wrote:
|
@shoyer, I've already updated the comment with an example, here's some code.
What is the recommended way to select all people higher then 203 cm (now that df.height > 203 doesn't work) ? |
also, just found that before #9848: df.height > 10
Out[30]:
0 True
1 True
2 True
... and after:
the second should probably raise with |
Indeed, it is not yet fully as it should be (@kay1793 thanks for pointing that out):
So I think this should either raise (with something like "4 not a category, so cannot compare"), or either do a correct comparison. |
@kay1793 BTW, if you want to compare based on the "label values" (so on the values with the dtype of the categories, not as one of the categories), you can always convert to array and compare then:
|
The |
@kay1793 Ok, wait, I think I already have a fix, which I will push shortly... |
I see that creating a full materialized copy of the series is what was happening anyway before #9848, which means you get 2 copies of the entire series, a scalar series followed by a bool series, the middle step is unnecessary, especially on an ordered series. @JanSchulz, I don't understand what's going on with equality vs. inequality when comparing with a non-category value #9848 (comment). |
@kay1793 BTW: the above example is misleading as height is a metric variable and therefore should not be converted to Categorical :-) |
@JanSchulz , they're quantized into bins so they are both... |
if height is binned, it should probably be saved as intervals... which will need a real interval type to get right :) |
call it the number of games participated in for every NBA player in history then - discrete but unbinned. details... |
@kay1793 But is not because something is discrete, that you should put it in a categorical ? |
Should be some sort of integer dtype, I think :). |
if it's discrete and taken from a finite small set I think categories can often make sense. if that ties in nicely to plotting and similar conveniences for example. |
... test-subjects by country (ordered by population size) and I'm interested in all people from countries whose name comes before "Zambia", because the hypothesis is that living in a country whose name begins with z correlates with lower life expectency (possibly true). |
This seems very hypothetical to me. But you can still do this sort of thing by using categories explicitly, e.g., |
An issue with this is that it encourages writing non-vectorized code. this should be avoided as you mentioned in another issue. Also, This seems very verbose to me. |
you can also do That will be a trade-off you have to consider, using a categorical type or not. It provides some nice features, but also has it consequences that it is not regarded anymore as just a value. |
Well, there shouldn't be too many categories in most cases, but it would also work as a vectorized operation, just a little more verbose: cats = df.country.cat.categories
df[df.country.isin(cats[cats < 'z'])] |
Yeah, that's probably something for pandas-ply to use as an another example for "pandas code is ugly" :).
joris, the thing is that this compromise is purely accidental. If you had some way (by use of types, probably) of asking for that behavior, But both solutions work, that's true enough. |
once people begin to create packages solely to avoid what is considered "idiomatic" code... might be a good idea to take notice. |
Why is this compromise accidental? Even if there would be some kind of categorical singleton, so you could explicitely compare to a category (pseudo code of |
There's no reason not to support both concisely but only one of them is - in that sense it is accidental. The confusion you mention seems very hypothetical to to me. That would only happen if you tried to guess what the user wants. Being explicit about it should work ok: see Having a categorical singleton wouldn't help at this point because you just made Why isn't there for example an
a little shorter and much clearer. |
There is:
same works with
Do you have a suggestion of an interface? |
That's strange.
now Anyway, you're right for series with cat dtype
You could have a proxy (available as an attr) that implemented the magic methods by looking up the value associated with the code invisibly. To be honest I'd just use just call |
When subsetting an ordered
pd.Categorical
object using less than/greater than on the ordered values, the less/than greater than follow lexicographical order, not categorical order.If you create a dataframe and assign categories, you can subset:
But if you try to subset on an ordered category, it does the lexicographical order instead:
The text was updated successfully, but these errors were encountered: