less than, greater than cuts on categorical don't follow order #9836

olgabot · 2015-04-08T14:40:25Z

When subsetting an ordered pd.Categorical object using less than/greater than on the ordered values, the less/than greater than follow lexicographical order, not categorical order.

If you create a dataframe and assign categories, you can subset:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({'x':np.arange(10), 'y':list('AAAABBBCCC')})

In [4]: df.y = df.y.astype('category')

In [5]: df.ix[(df.y >= "A") & (df.y <= "B")]
Out[5]: 
   x  y
0  0  A
1  1  A
2  2  A
3  3  A
4  4  B
5  5  B
6  6  B

But if you try to subset on an ordered category, it does the lexicographical order instead:

In [6]: df = pd.DataFrame({'x':np.arange(10), 'y':list('AAAABBBCCC')})

In [7]: df.y = pd.Categorical(df.y, categories=['A', 'C', 'B'], ordered=True)

In [8]: df.ix[(df.y >= "A") & (df.y <= "C")]
Out[8]: 
   x  y
0  0  A
1  1  A
2  2  A
3  3  A
4  4  B
5  5  B
6  6  B
7  7  C
8  8  C
9  9  C

The text was updated successfully, but these errors were encountered:

jreback · 2015-04-08T14:44:43Z

ordered=True just means that the categories themselves are in order (and its meaningful).

A subsetting operation does selection which is ordered by the index of the frame.
You need to do an operation that matters to the actual order, e.g. sort/min/max

In [18]: In [8]: df.ix[(df.y >= "A") & (df.y <= "C")].sort('y')
Out[18]: 
   x  y
0  0  A
1  1  A
2  2  A
3  3  A
7  7  C
8  8  C
9  9  C
4  4  B
5  5  B
6  6  B

Note if you had a CategoricalIndex (merging very shorty), then this would work as you expect

jorisvandenbossche · 2015-04-10T13:03:16Z

@jreback I don't fully understand your explanation of the behaviour.

I think the point of having an ordered categorical is that this order is used in comparisons ?

jorisvandenbossche · 2015-04-10T13:07:36Z

Relabeled the issue until we agree on that :-)

cc @JanSchulz @shoyer

So the point is this:

In [34]: cat = Series([1, 2, 3]).astype("category", categories=[3, 2, 1], ordered=True)

In [35]: cat
Out[35]:
0    1
1    2
2    3
dtype: category
Categories (3, object): [3 < 2 < 1]

In [37]: cat[[0]]
Out[37]:
0    1
dtype: category
Categories (3, object): [3 < 2 < 1]

In [38]: cat[[0]] > 2
Out[38]:
0    False
dtype: bool

Should the categorical value 1 be larger as 2 (it is larger based on the categorical order, it is not larger based on normal order)

jankatins · 2015-04-10T13:48:05Z

I agree with @jorisvandenbossche: in the last example, the comparison cat[[0]] > 2 should be True and df.y <= "C" should be False for all B in df.y.

Not sure what happened here, but the docs also show this "wrong" behavior: http://pandas.pydata.org/pandas-docs/version/0.15.2/categorical.html#comparisons
-> Both comparisons should have the same output

On the other hand, with the ordered = False default change they should raise an error :-(

kay1793 · 2015-04-10T13:48:34Z

The implementation of categories in pandas doesn't distinguish a (single) catgeory from its underlying value:

cat = Series([1, 2, 3]).astype("category", categories=[3, 2, 1], ordered=True)
type(cat[0])
Out[7]: numpy.int64
# *not* "pandas.Category(value=1,parent=Foo)"

I expect users sometimes want to subset based on value and other times by order (just as loc/iloc selects based on label or location) but the way a category and its value are lumped into one makes it difficult to support both. If you did have singleton categories as objects, you would need to have them reference the parent set of categories they belong to in order to be meaningful.

jankatins · 2015-04-10T13:50:39Z

@kay1793: ok, good catch, this should account for cat[0] > 2, but not for df.y > "C"

kay1793 · 2015-04-10T13:53:29Z

That's true.

If the cat series comparison adhered to order, users could still get a value based comparison
by "lowering" the category series into a series of values. That's bound to confuse newcomers so hopefully the error message will be informative.

jorisvandenbossche · 2015-04-10T13:56:56Z

@kay1793 Indeed, but that is the limitation of numpy we have to live with at the moment (and that is also the reason I used cat[[0]] in my example ..)

jorisvandenbossche · 2015-04-10T13:58:39Z

@JanSchulz regarding the question on "raising for ordered=False" -> at the moment we said that sorting such a category should work. So maybe it is a bit unconsistent that a direct comparison of greater/smaller than would raise?

kay1793 · 2015-04-10T14:01:58Z

That's true.

That is, unless you think what olga is asking for should work as:

cats[cats > pd.category(1,parent=<Cats>)]

or similar, rather then forcing anyone who wants cats[cats < 2] to downcast the entire array first in order to compare by value. it's a question of whether you choose the semantics based on the type of the category array or the type of other (when comparing with a category array: 2 < cats and cats >2).

jorisvandenbossche · 2015-04-10T14:09:04Z

It would indeed be an idea to have some kind of 'categorical singleton' (just a value but with a category dtype attached), however, still, I think comparing with a 'normal' value should always try to see this value as a category.

@JanSchulz about the raising question: R does indeed raise on comparing factors (> not meaningful for factors), and it works for ordered factors as is now proposed here (comparing the order of the categories)

jankatins · 2015-04-10T14:25:03Z

@jorisvandenbossche I think sorting and comparing should be different here: I don't mind if my sorting succeeds when it shouldn't but comparing when it isn't comparable is not so nice (I think along this "example": "I can sort blue and green stones, but I can't compare them")

jankatins · 2015-04-10T14:25:50Z

I'm currently trying to find the problem. I've a few testcases, which currently raise, so lets see...

jorisvandenbossche · 2015-04-10T14:33:06Z

@JanSchulz I think that is a nice analogy for the difference between sorting and comparing! I am convinced :-)

olgabot · 2015-04-10T14:41:00Z

@JanSchulz If the user has defined how they want blue and green to be sorted, then less than/greater than should make sense :)

jankatins · 2015-04-10T14:44:59Z

Jikes:

In[3]: cat = Series([1, 2, 3]).astype("category", categories=[3, 2, 1], ordered=True)
In[4]: cat > 2
Out[4]: 
0    False
1    False
2     True
dtype: bool
In[5]: cat.values > 2
Out[5]: array([ True, False, False], dtype=bool)

olgabot · 2015-04-10T14:46:33Z

??? how is that possible?

jankatins · 2015-04-10T15:00:25Z

There is a values = self.get_values() in https://github.com/pydata/pandas/blob/master/pandas/core/ops.py#L597

After that line, values is an ndarray which does lexi comparisons.

jankatins · 2015-04-10T20:00:24Z

Ok, with #9848 this is now:

In[2]: import pandas as pd
Backend Qt4Agg is interactive backend. Turning interactive mode on.
In[3]: import numpy as np
In[4]:  df = pd.DataFrame({'x':np.arange(10), 'y':list('AAAABBBCCC')})
In[5]: df.y = pd.Categorical(df.y, categories=['A', 'C', 'B'], ordered=True)
In[6]: df.ix[(df.y >= "A") & (df.y <= "C")]
Out[6]: 
   x  y
0  0  A
1  1  A
2  2  A
3  3  A
7  7  C
8  8  C
9  9  C

and

In[8]: cat = pd.Series([1, 2, 3]).astype("category", categories=[3, 2, 1], ordered=True)
In[9]: cat[[0]]
Out[9]: 
0    1
dtype: category
Categories (3, int64): [3 < 2 < 1]
In[10]: cat[[0]] > 2
Out[10]: 
0    True
dtype: bool

jankatins · 2015-04-10T20:01:57Z

As far as I can understand the codepaths and the output of git blame this error was present since the beginning of "Categoricals as blocks" :-/

jreback · 2015-04-11T20:07:12Z

closed by #9848

kay1793 · 2015-04-12T15:13:50Z

Now that the behaviour is changed, what is the officially sanctioned way for doing comparisons based
on the underlying value of categories?

jorisvandenbossche · 2015-04-12T17:48:08Z

What do you mean with the "underlying value"? The behaviour it was before?

kay1793 · 2015-04-12T18:10:21Z

not the behaviour, but what it enabled. yes, I mean the "label" value.

I'm thinking in particular of having ~~ordinal~~ interval categories in multiple frames, where it makes sense
to subset by value, but categories may vary between frames. For example heights in 5cm bins here
and 10cm bins there, and I wish to find all individual higher then 181cm.

shoyer · 2015-04-12T18:49:58Z

Well, maybe you want to use .codes directly? What is the actual use case?

On Sun, Apr 12, 2015 at 2:10 PM, kay1793 [email protected] wrote:

not the behaviour, but what it enabled.

Reply to this email directly or view it on GitHub:
#9836 (comment)

kay1793 · 2015-04-12T20:07:09Z

@shoyer, I've already updated the comment with an example, here's some code.
unfortunately .codes doesn't relate to the label's value it's only an index into the label array.

In [24]: s=pd.Series([5*i for i in np.random.randint(30,45,20)])
    ...: c = pd.Categorical(s, categories=sorted(set(s)), ordered=True)
    ...: n= pd.Series([" ".join([x,y]) 
    ...: for x in ["Francois","Jake","Cecile","Mike","Brittany"]
    ...: for y in ["Singh","Cohen","O'Malley","Durant"]
    ...: ])
    ...: df=pd.DataFrame(dict(height=c,names=n))
    ...: df
Out[24]: 
   height              names
0     160     Francois Singh
1     200     Francois Cohen
2     200  Francois O'Malley
3     205    Francois Durant
4     175         Jake Singh
5     165         Jake Cohen
6     195      Jake O'Malley
7     195        Jake Durant
8     180       Cecile Singh
9     185       Cecile Cohen
10    190    Cecile O'Malley
11    215      Cecile Durant
12    175         Mike Singh
13    210         Mike Cohen
14    215      Mike O'Malley
15    175        Mike Durant
16    205     Brittany Singh
17    155     Brittany Cohen
18    220  Brittany O'Malley
19    205    Brittany Durant

In [25]: df.height
<...>
Name: height, dtype: category
Categories (13, int64): [155 < 160 < 165 < 175 ... 205 < 210 < 215 < 220]

What is the recommended way to select all people higher then 203 cm (now that df.height > 203 doesn't work) ?

kay1793 · 2015-04-12T20:10:14Z

also, just found that before #9848:

df.height > 10
Out[30]: 
0     True
1     True
2     True
...

and after:

df.height > 10
Out[7]: 
0     False
1     False
2     False
3     False

the second should probably raise with NoSuchCategory. @JanSchulz (pandas: 0.16.0-104-gd27e0a6)

jorisvandenbossche · 2015-04-12T20:45:15Z

Indeed, it is not yet fully as it should be (@kay1793 thanks for pointing that out):

In [5]: s = pd.Series([1, 5, 10]).astype('category', ordered=True)

In [6]: s
Out[6]:
0     1
1     5
2    10
dtype: category
Categories (3, int64): [1 < 5 < 10]

In [7]: s > 5
Out[7]:
0    False
1    False
2     True
dtype: bool

In [8]: s > 4
Out[8]:
0    False
1    False
2    False
dtype: bool

So I think this should either raise (with something like "4 not a category, so cannot compare"), or either do a correct comparison.
But as we actually cannot say what a correct comparison is (since you can give a specified order (like 1, 10, 5), where should 4 fit in?), I think the only option is to raise?

jorisvandenbossche · 2015-04-12T20:50:01Z

@kay1793 BTW, if you want to compare based on the "label values" (so on the values with the dtype of the categories, not as one of the categories), you can always convert to array and compare then:

In [9]: np.array(s) > 4
Out[9]: array([False,  True,  True], dtype=bool)

jankatins · 2015-04-12T20:59:39Z

The s > 4 example should raise. If you want to have that succeed, you need to do s.astype(int) > 4 or np.asarray(s) > 4

jankatins · 2015-04-12T21:09:51Z

@kay1793 Ok, wait, I think I already have a fix, which I will push shortly...

kay1793 · 2015-04-12T21:10:59Z

I see that creating a full materialized copy of the series is what was happening anyway before #9848, which means you get 2 copies of the entire series, a scalar series followed by a bool series, the middle step is unnecessary, especially on an ordered series.

@JanSchulz, I don't understand what's going on with equality vs. inequality when comparing with a non-category value #9848 (comment).

jankatins · 2015-04-12T21:11:39Z

@kay1793 BTW: the above example is misleading as height is a metric variable and therefore should not be converted to Categorical :-)

kay1793 · 2015-04-12T21:11:58Z

@JanSchulz , they're quantized into bins so they are both...

shoyer · 2015-04-12T21:13:57Z

if height is binned, it should probably be saved as intervals... which will need a real interval type to get right :)

kay1793 · 2015-04-12T21:34:12Z

call it the number of games participated in for every NBA player in history then - discrete but unbinned. details...

jorisvandenbossche · 2015-04-12T21:35:42Z

@kay1793 But is not because something is discrete, that you should put it in a categorical ?

shoyer · 2015-04-12T21:35:49Z

call it the number of games participated in for every NBA player in history then - discrete but unbinned. details...

Should be some sort of integer dtype, I think :).

kay1793 · 2015-04-12T21:45:29Z

if it's discrete and taken from a finite small set I think categories can often make sense. if that ties in nicely to plotting and similar conveniences for example.

jankatins · 2015-04-12T21:50:16Z

Ok, please have a look at #9864. lets see what the tests say...

@kay1793 statics tests will handle it wrongly, e.g. OLS will probably add a dummy variable for all (n-1) categories for height/ no of games and you probably only want to it treated as a metric variable.

kay1793 · 2015-04-12T22:00:53Z

... test-subjects by country (ordered by population size) and I'm interested in all people from countries whose name comes before "Zambia", because the hypothesis is that living in a country whose name begins with z correlates with lower life expectency (possibly true).

shoyer · 2015-04-15T18:13:22Z

... test-subjects by country (ordered by population size) and I'm interested in all people from countries whose name comes before "Zambia", because the hypothesis is that living in a country whose name begins with z correlates with lower life expectency (possibly true).

This seems very hypothetical to me. But you can still do this sort of thing by using categories explicitly, e.g., df[df.country.isin([c for c in df.county.cat.categories if c < 'z'])]

kay1793 · 2015-04-15T21:28:37Z

df[df.country.isin([c for c in df.county.cat.categories if c < 'z'])]

An issue with this is that it encourages writing non-vectorized code. this should be avoided as you mentioned in another issue. Also, This seems very verbose to me.

jorisvandenbossche · 2015-04-15T21:40:54Z

you can also do np.array(df.countries) < 'Z' if you want to do a comparison on the raw values.

That will be a trade-off you have to consider, using a categorical type or not. It provides some nice features, but also has it consequences that it is not regarded anymore as just a value.

shoyer · 2015-04-15T21:52:25Z

Well, there shouldn't be too many categories in most cases, but it would also work as a vectorized operation, just a little more verbose:

cats = df.country.cat.categories
df[df.country.isin(cats[cats < 'z'])]

kay1793 · 2015-04-15T21:58:34Z

Yeah, that's probably something for pandas-ply to use as an another example for "pandas code is ugly" :).

In our opinion, this pandas-ply code is cleaner, more expressive, more readable, 
more concise, and less error-prone than the original pandas code.

joris, the thing is that this compromise is purely accidental. If you had some way (by use of types, probably) of asking for that behavior, __lt__ knows that it is attached to a categorical type and could lookup the label value associated with the catcode as it goes along. The user code would be more concise and there'd be no need to materialize a full copy of the de-factorized array (which is mem-costly, for string labels).

But both solutions work, that's true enough.

kay1793 · 2015-04-15T22:01:55Z

once people begin to create packages solely to avoid what is considered "idiomatic" code... might be a good idea to take notice.

jorisvandenbossche · 2015-04-15T22:12:37Z

Why is this compromise accidental?
Say I have a categorical that has categories ["low", "middle", "high"], I want < "high" to work on the categories order, which just conflicts with the type of comparison you want to make. So you have to make a choice here I think?

Even if there would be some kind of categorical singleton, so you could explicitely compare to a category (pseudo code of < category('high')), and so __lt__ could know it is comparing against a category and so uses the categorical order, I would argue that the plain string comparison (< 'high') should also do this and not fallback to plain non-categorical comparison, as this would only be the cause of confusion and bugs.

kay1793 · 2015-04-15T22:38:16Z

There's no reason not to support both concisely but only one of them is - in that sense it is accidental. The confusion you mention seems very hypothetical to to me. That would only happen if you tried to guess what the user wants. Being explicit about it should work ok: see .ix vs .loc/.iloc.

Having a categorical singleton wouldn't help at this point because you just made [cat < 6] (instead of [cat < Cat(6)]) mean compare by cat-order forever. You simply had one syntactic "slot" to fill and chose to throw away one and keep the other, but both are useful.

Why isn't there for example an .astype('object|int|S16') attr on category series? it wouldn't be efficient but at least it'd be easy.

df[df.country.isin([c for c in df.county.cat.categories if c < 'z'])]
vs.
df[df.country.astype('object') <'z']

a little shorter and much clearer.

jorisvandenbossche · 2015-04-15T22:47:53Z

There is:

In [6]: s = pd.Series([1, 2, 3, 1, 2, 1], dtype='category')

In [7]: s
Out[7]:
0    1
1    2
2    3
3    1
4    2
5    1
dtype: category
Categories (3, int64): [1, 2, 3]

In [8]: s.astype('int64')
Out[8]:
0    1
1    2
2    3
3    1
4    2
5    1
dtype: int64

same works with astype(str) for string categories.
But this is more or less the same as the np.array(s) I suggested (df[np.asarray(df.country) <'z']). But is indeed maybe a bit more idiomatic. Maybe we should mention that in the docs as well.

There's no reason not to support both concisely

Do you have a suggestion of an interface?

kay1793 · 2015-04-15T23:01:21Z

That's strange.

In [6]: df = pd.DataFrame({'x':np.arange(10), 'y':list('AAAABBBCCC')})
In [7]: foo = pd.Categorical(df.y, categories=['A', 'C', 'B'], ordered=True)
In [8]: df.y = foo

now foo doesn't have astype but df.y does. I understand why but that's a quirk.

Anyway, you're right for series with cat dtype astype is there and that's idiomatic and easy enough.

Do you have a suggestion of an interface?

You could have a proxy (available as an attr) that implemented the magic methods by looking up the value associated with the code invisibly. To be honest I'd just use astype because it would be a lot of work and I'm not convinced it matters enough in terms of performance to do it.

just call astype the recommended way.

jreback added Usage Question Categorical Categorical Data Type labels Apr 8, 2015

jorisvandenbossche added API Design and removed Usage Question labels Apr 10, 2015

jorisvandenbossche added this to the 0.16.1 milestone Apr 10, 2015

jankatins mentioned this issue Apr 10, 2015

Fix for unequal comparisons of categorical and scalar #9848

Closed

olgabot mentioned this issue Apr 10, 2015

For ordered sample subsets, be able to take any flexible range YeoLab/flotilla#287

Open

jreback closed this as completed Apr 11, 2015

jankatins mentioned this issue Apr 12, 2015

Fix for comparisons of categorical and an scalar not in categories #9864

Closed

less than, greater than cuts on categorical don't follow order #9836

less than, greater than cuts on categorical don't follow order #9836

Comments

olgabot commented Apr 8, 2015

jreback commented Apr 8, 2015

jorisvandenbossche commented Apr 10, 2015

jorisvandenbossche commented Apr 10, 2015

jankatins commented Apr 10, 2015

kay1793 commented Apr 10, 2015

jankatins commented Apr 10, 2015

kay1793 commented Apr 10, 2015

jorisvandenbossche commented Apr 10, 2015

jorisvandenbossche commented Apr 10, 2015

kay1793 commented Apr 10, 2015

jorisvandenbossche commented Apr 10, 2015

jankatins commented Apr 10, 2015

jankatins commented Apr 10, 2015

jorisvandenbossche commented Apr 10, 2015

olgabot commented Apr 10, 2015

jankatins commented Apr 10, 2015

olgabot commented Apr 10, 2015

jankatins commented Apr 10, 2015

jankatins commented Apr 10, 2015

jankatins commented Apr 10, 2015

jreback commented Apr 11, 2015

kay1793 commented Apr 12, 2015

jorisvandenbossche commented Apr 12, 2015

kay1793 commented Apr 12, 2015

shoyer commented Apr 12, 2015

not the behaviour, but what it enabled.

kay1793 commented Apr 12, 2015

kay1793 commented Apr 12, 2015

jorisvandenbossche commented Apr 12, 2015

jorisvandenbossche commented Apr 12, 2015

jankatins commented Apr 12, 2015

jankatins commented Apr 12, 2015

kay1793 commented Apr 12, 2015

jankatins commented Apr 12, 2015

kay1793 commented Apr 12, 2015

shoyer commented Apr 12, 2015

kay1793 commented Apr 12, 2015

jorisvandenbossche commented Apr 12, 2015

shoyer commented Apr 12, 2015

kay1793 commented Apr 12, 2015

jankatins commented Apr 12, 2015

kay1793 commented Apr 12, 2015

shoyer commented Apr 15, 2015

kay1793 commented Apr 15, 2015

jorisvandenbossche commented Apr 15, 2015

shoyer commented Apr 15, 2015

kay1793 commented Apr 15, 2015

kay1793 commented Apr 15, 2015

jorisvandenbossche commented Apr 15, 2015

kay1793 commented Apr 15, 2015

jorisvandenbossche commented Apr 15, 2015

kay1793 commented Apr 15, 2015