Skip to content

less than, greater than cuts on categorical don't follow order #9836

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
olgabot opened this issue Apr 8, 2015 · 51 comments
Closed

less than, greater than cuts on categorical don't follow order #9836

olgabot opened this issue Apr 8, 2015 · 51 comments
Labels
API Design Categorical Categorical Data Type
Milestone

Comments

@olgabot
Copy link

olgabot commented Apr 8, 2015

When subsetting an ordered pd.Categorical object using less than/greater than on the ordered values, the less/than greater than follow lexicographical order, not categorical order.

If you create a dataframe and assign categories, you can subset:

In [1]: import pandas as pd

In [2]: import numpy as np

In [3]: df = pd.DataFrame({'x':np.arange(10), 'y':list('AAAABBBCCC')})

In [4]: df.y = df.y.astype('category')

In [5]: df.ix[(df.y >= "A") & (df.y <= "B")]
Out[5]: 
   x  y
0  0  A
1  1  A
2  2  A
3  3  A
4  4  B
5  5  B
6  6  B

But if you try to subset on an ordered category, it does the lexicographical order instead:

In [6]: df = pd.DataFrame({'x':np.arange(10), 'y':list('AAAABBBCCC')})

In [7]: df.y = pd.Categorical(df.y, categories=['A', 'C', 'B'], ordered=True)

In [8]: df.ix[(df.y >= "A") & (df.y <= "C")]
Out[8]: 
   x  y
0  0  A
1  1  A
2  2  A
3  3  A
4  4  B
5  5  B
6  6  B
7  7  C
8  8  C
9  9  C
@jreback
Copy link
Contributor

jreback commented Apr 8, 2015

ordered=True just means that the categories themselves are in order (and its meaningful).

A subsetting operation does selection which is ordered by the index of the frame.
You need to do an operation that matters to the actual order, e.g. sort/min/max

In [18]: In [8]: df.ix[(df.y >= "A") & (df.y <= "C")].sort('y')
Out[18]: 
   x  y
0  0  A
1  1  A
2  2  A
3  3  A
7  7  C
8  8  C
9  9  C
4  4  B
5  5  B
6  6  B

Note if you had a CategoricalIndex (merging very shorty), then this would work as you expect

@jreback jreback added Usage Question Categorical Categorical Data Type labels Apr 8, 2015
@jorisvandenbossche
Copy link
Member

@jreback I don't fully understand your explanation of the behaviour.

I think the point of having an ordered categorical is that this order is used in comparisons ?

@jorisvandenbossche
Copy link
Member

Relabeled the issue until we agree on that :-)

cc @JanSchulz @shoyer

So the point is this:

In [34]: cat = Series([1, 2, 3]).astype("category", categories=[3, 2, 1], ordered=True)

In [35]: cat
Out[35]:
0    1
1    2
2    3
dtype: category
Categories (3, object): [3 < 2 < 1]

In [37]: cat[[0]]
Out[37]:
0    1
dtype: category
Categories (3, object): [3 < 2 < 1]

In [38]: cat[[0]] > 2
Out[38]:
0    False
dtype: bool

Should the categorical value 1 be larger as 2 (it is larger based on the categorical order, it is not larger based on normal order)

@jorisvandenbossche jorisvandenbossche added this to the 0.16.1 milestone Apr 10, 2015
@jankatins
Copy link
Contributor

I agree with @jorisvandenbossche: in the last example, the comparison cat[[0]] > 2 should be True and df.y <= "C" should be False for all B in df.y.

Not sure what happened here, but the docs also show this "wrong" behavior: http://pandas.pydata.org/pandas-docs/version/0.15.2/categorical.html#comparisons
-> Both comparisons should have the same output

On the other hand, with the ordered = False default change they should raise an error :-(

@kay1793
Copy link

kay1793 commented Apr 10, 2015

The implementation of categories in pandas doesn't distinguish a (single) catgeory from its underlying value:

cat = Series([1, 2, 3]).astype("category", categories=[3, 2, 1], ordered=True)
type(cat[0])
Out[7]: numpy.int64
# *not* "pandas.Category(value=1,parent=Foo)"

I expect users sometimes want to subset based on value and other times by order (just as loc/iloc selects based on label or location) but the way a category and its value are lumped into one makes it difficult to support both. If you did have singleton categories as objects, you would need to have them reference the parent set of categories they belong to in order to be meaningful.

@jankatins
Copy link
Contributor

@kay1793: ok, good catch, this should account for cat[0] > 2, but not for df.y > "C"

@kay1793
Copy link

kay1793 commented Apr 10, 2015

That's true.

If the cat series comparison adhered to order, users could still get a value based comparison
by "lowering" the category series into a series of values. That's bound to confuse newcomers so hopefully the error message will be informative.

@jorisvandenbossche
Copy link
Member

@kay1793 Indeed, but that is the limitation of numpy we have to live with at the moment (and that is also the reason I used cat[[0]] in my example ..)

@jorisvandenbossche
Copy link
Member

@JanSchulz regarding the question on "raising for ordered=False" -> at the moment we said that sorting such a category should work. So maybe it is a bit unconsistent that a direct comparison of greater/smaller than would raise?

@kay1793
Copy link

kay1793 commented Apr 10, 2015

That's true.

That is, unless you think what olga is asking for should work as:

cats[cats > pd.category(1,parent=<Cats>)]

or similar, rather then forcing anyone who wants cats[cats < 2] to downcast the entire array first in order to compare by value. it's a question of whether you choose the semantics based on the type of the category array or the type of other (when comparing with a category array: 2 < cats and cats >2).

@jorisvandenbossche
Copy link
Member

It would indeed be an idea to have some kind of 'categorical singleton' (just a value but with a category dtype attached), however, still, I think comparing with a 'normal' value should always try to see this value as a category.

@JanSchulz about the raising question: R does indeed raise on comparing factors (> not meaningful for factors), and it works for ordered factors as is now proposed here (comparing the order of the categories)

@jankatins
Copy link
Contributor

@jorisvandenbossche I think sorting and comparing should be different here: I don't mind if my sorting succeeds when it shouldn't but comparing when it isn't comparable is not so nice (I think along this "example": "I can sort blue and green stones, but I can't compare them")

@jankatins
Copy link
Contributor

I'm currently trying to find the problem. I've a few testcases, which currently raise, so lets see...

@jorisvandenbossche
Copy link
Member

@JanSchulz I think that is a nice analogy for the difference between sorting and comparing! I am convinced :-)

@olgabot
Copy link
Author

olgabot commented Apr 10, 2015

@JanSchulz If the user has defined how they want blue and green to be sorted, then less than/greater than should make sense :)

@jankatins
Copy link
Contributor

Jikes:

In[3]: cat = Series([1, 2, 3]).astype("category", categories=[3, 2, 1], ordered=True)
In[4]: cat > 2
Out[4]: 
0    False
1    False
2     True
dtype: bool
In[5]: cat.values > 2
Out[5]: array([ True, False, False], dtype=bool)

@olgabot
Copy link
Author

olgabot commented Apr 10, 2015

??? how is that possible?

@jankatins
Copy link
Contributor

There is a values = self.get_values() in https://github.com/pydata/pandas/blob/master/pandas/core/ops.py#L597

After that line, values is an ndarray which does lexi comparisons.

@jankatins
Copy link
Contributor

Ok, with #9848 this is now:

In[2]: import pandas as pd
Backend Qt4Agg is interactive backend. Turning interactive mode on.
In[3]: import numpy as np
In[4]:  df = pd.DataFrame({'x':np.arange(10), 'y':list('AAAABBBCCC')})
In[5]: df.y = pd.Categorical(df.y, categories=['A', 'C', 'B'], ordered=True)
In[6]: df.ix[(df.y >= "A") & (df.y <= "C")]
Out[6]: 
   x  y
0  0  A
1  1  A
2  2  A
3  3  A
7  7  C
8  8  C
9  9  C

and

In[8]: cat = pd.Series([1, 2, 3]).astype("category", categories=[3, 2, 1], ordered=True)
In[9]: cat[[0]]
Out[9]: 
0    1
dtype: category
Categories (3, int64): [3 < 2 < 1]
In[10]: cat[[0]] > 2
Out[10]: 
0    True
dtype: bool

@jankatins
Copy link
Contributor

As far as I can understand the codepaths and the output of git blame this error was present since the beginning of "Categoricals as blocks" :-/

@jreback
Copy link
Contributor

jreback commented Apr 11, 2015

closed by #9848

@jreback jreback closed this as completed Apr 11, 2015
@kay1793
Copy link

kay1793 commented Apr 12, 2015

Now that the behaviour is changed, what is the officially sanctioned way for doing comparisons based
on the underlying value of categories?

@jorisvandenbossche
Copy link
Member

What do you mean with the "underlying value"? The behaviour it was before?

@kay1793
Copy link

kay1793 commented Apr 12, 2015

not the behaviour, but what it enabled. yes, I mean the "label" value.

I'm thinking in particular of having ordinal interval categories in multiple frames, where it makes sense
to subset by value, but categories may vary between frames. For example heights in 5cm bins here
and 10cm bins there, and I wish to find all individual higher then 181cm.

@shoyer
Copy link
Member

shoyer commented Apr 12, 2015

Well, maybe you want to use .codes directly? What is the actual use case?

On Sun, Apr 12, 2015 at 2:10 PM, kay1793 [email protected] wrote:

not the behaviour, but what it enabled.

Reply to this email directly or view it on GitHub:
#9836 (comment)

@kay1793
Copy link

kay1793 commented Apr 12, 2015

@shoyer, I've already updated the comment with an example, here's some code.
unfortunately .codes doesn't relate to the label's value it's only an index into the label array.

In [24]: s=pd.Series([5*i for i in np.random.randint(30,45,20)])
    ...: c = pd.Categorical(s, categories=sorted(set(s)), ordered=True)
    ...: n= pd.Series([" ".join([x,y]) 
    ...: for x in ["Francois","Jake","Cecile","Mike","Brittany"]
    ...: for y in ["Singh","Cohen","O'Malley","Durant"]
    ...: ])
    ...: df=pd.DataFrame(dict(height=c,names=n))
    ...: df
Out[24]: 
   height              names
0     160     Francois Singh
1     200     Francois Cohen
2     200  Francois O'Malley
3     205    Francois Durant
4     175         Jake Singh
5     165         Jake Cohen
6     195      Jake O'Malley
7     195        Jake Durant
8     180       Cecile Singh
9     185       Cecile Cohen
10    190    Cecile O'Malley
11    215      Cecile Durant
12    175         Mike Singh
13    210         Mike Cohen
14    215      Mike O'Malley
15    175        Mike Durant
16    205     Brittany Singh
17    155     Brittany Cohen
18    220  Brittany O'Malley
19    205    Brittany Durant

In [25]: df.height
<...>
Name: height, dtype: category
Categories (13, int64): [155 < 160 < 165 < 175 ... 205 < 210 < 215 < 220]

What is the recommended way to select all people higher then 203 cm (now that df.height > 203 doesn't work) ?

@kay1793
Copy link

kay1793 commented Apr 12, 2015

also, just found that before #9848:

df.height > 10
Out[30]: 
0     True
1     True
2     True
...

and after:

df.height > 10
Out[7]: 
0     False
1     False
2     False
3     False

the second should probably raise with NoSuchCategory. @JanSchulz (pandas: 0.16.0-104-gd27e0a6)

@jorisvandenbossche
Copy link
Member

Indeed, it is not yet fully as it should be (@kay1793 thanks for pointing that out):

In [5]: s = pd.Series([1, 5, 10]).astype('category', ordered=True)

In [6]: s
Out[6]:
0     1
1     5
2    10
dtype: category
Categories (3, int64): [1 < 5 < 10]

In [7]: s > 5
Out[7]:
0    False
1    False
2     True
dtype: bool

In [8]: s > 4
Out[8]:
0    False
1    False
2    False
dtype: bool

So I think this should either raise (with something like "4 not a category, so cannot compare"), or either do a correct comparison.
But as we actually cannot say what a correct comparison is (since you can give a specified order (like 1, 10, 5), where should 4 fit in?), I think the only option is to raise?

@jorisvandenbossche
Copy link
Member

@kay1793 BTW, if you want to compare based on the "label values" (so on the values with the dtype of the categories, not as one of the categories), you can always convert to array and compare then:

In [9]: np.array(s) > 4
Out[9]: array([False,  True,  True], dtype=bool)

@jankatins
Copy link
Contributor

The s > 4 example should raise. If you want to have that succeed, you need to do s.astype(int) > 4 or np.asarray(s) > 4

@jankatins
Copy link
Contributor

@kay1793 Ok, wait, I think I already have a fix, which I will push shortly...

@kay1793
Copy link

kay1793 commented Apr 12, 2015

I see that creating a full materialized copy of the series is what was happening anyway before #9848, which means you get 2 copies of the entire series, a scalar series followed by a bool series, the middle step is unnecessary, especially on an ordered series.

@JanSchulz, I don't understand what's going on with equality vs. inequality when comparing with a non-category value #9848 (comment).

@jankatins
Copy link
Contributor

@kay1793 BTW: the above example is misleading as height is a metric variable and therefore should not be converted to Categorical :-)

@kay1793
Copy link

kay1793 commented Apr 12, 2015

@JanSchulz , they're quantized into bins so they are both...

@shoyer
Copy link
Member

shoyer commented Apr 12, 2015

if height is binned, it should probably be saved as intervals... which will need a real interval type to get right :)

@kay1793
Copy link

kay1793 commented Apr 12, 2015

call it the number of games participated in for every NBA player in history then - discrete but unbinned. details...

@jorisvandenbossche
Copy link
Member

@kay1793 But is not because something is discrete, that you should put it in a categorical ?

@shoyer
Copy link
Member

shoyer commented Apr 12, 2015

call it the number of games participated in for every NBA player in history then - discrete but unbinned. details...

Should be some sort of integer dtype, I think :).

@kay1793
Copy link

kay1793 commented Apr 12, 2015

if it's discrete and taken from a finite small set I think categories can often make sense. if that ties in nicely to plotting and similar conveniences for example.

@jankatins
Copy link
Contributor

Ok, please have a look at #9864. lets see what the tests say...

@kay1793 statics tests will handle it wrongly, e.g. OLS will probably add a dummy variable for all (n-1) categories for height/ no of games and you probably only want to it treated as a metric variable.

@kay1793
Copy link

kay1793 commented Apr 12, 2015

... test-subjects by country (ordered by population size) and I'm interested in all people from countries whose name comes before "Zambia", because the hypothesis is that living in a country whose name begins with z correlates with lower life expectency (possibly true).

@shoyer
Copy link
Member

shoyer commented Apr 15, 2015

... test-subjects by country (ordered by population size) and I'm interested in all people from countries whose name comes before "Zambia", because the hypothesis is that living in a country whose name begins with z correlates with lower life expectency (possibly true).

This seems very hypothetical to me. But you can still do this sort of thing by using categories explicitly, e.g., df[df.country.isin([c for c in df.county.cat.categories if c < 'z'])]

@kay1793
Copy link

kay1793 commented Apr 15, 2015

df[df.country.isin([c for c in df.county.cat.categories if c < 'z'])]

An issue with this is that it encourages writing non-vectorized code. this should be avoided as you mentioned in another issue. Also, This seems very verbose to me.

@jorisvandenbossche
Copy link
Member

you can also do np.array(df.countries) < 'Z' if you want to do a comparison on the raw values.

That will be a trade-off you have to consider, using a categorical type or not. It provides some nice features, but also has it consequences that it is not regarded anymore as just a value.

@shoyer
Copy link
Member

shoyer commented Apr 15, 2015

Well, there shouldn't be too many categories in most cases, but it would also work as a vectorized operation, just a little more verbose:

cats = df.country.cat.categories
df[df.country.isin(cats[cats < 'z'])]

@kay1793
Copy link

kay1793 commented Apr 15, 2015

Yeah, that's probably something for pandas-ply to use as an another example for "pandas code is ugly" :).

In our opinion, this pandas-ply code is cleaner, more expressive, more readable, 
more concise, and less error-prone than the original pandas code.

joris, the thing is that this compromise is purely accidental. If you had some way (by use of types, probably) of asking for that behavior, __lt__ knows that it is attached to a categorical type and could lookup the label value associated with the catcode as it goes along. The user code would be more concise and there'd be no need to materialize a full copy of the de-factorized array (which is mem-costly, for string labels).

But both solutions work, that's true enough.

@kay1793
Copy link

kay1793 commented Apr 15, 2015

once people begin to create packages solely to avoid what is considered "idiomatic" code... might be a good idea to take notice.

@jorisvandenbossche
Copy link
Member

Why is this compromise accidental?
Say I have a categorical that has categories ["low", "middle", "high"], I want < "high" to work on the categories order, which just conflicts with the type of comparison you want to make. So you have to make a choice here I think?

Even if there would be some kind of categorical singleton, so you could explicitely compare to a category (pseudo code of < category('high')), and so __lt__ could know it is comparing against a category and so uses the categorical order, I would argue that the plain string comparison (< 'high') should also do this and not fallback to plain non-categorical comparison, as this would only be the cause of confusion and bugs.

@kay1793
Copy link

kay1793 commented Apr 15, 2015

There's no reason not to support both concisely but only one of them is - in that sense it is accidental. The confusion you mention seems very hypothetical to to me. That would only happen if you tried to guess what the user wants. Being explicit about it should work ok: see .ix vs .loc/.iloc.

Having a categorical singleton wouldn't help at this point because you just made [cat < 6] (instead of [cat < Cat(6)]) mean compare by cat-order forever. You simply had one syntactic "slot" to fill and chose to throw away one and keep the other, but both are useful.

Why isn't there for example an .astype('object|int|S16') attr on category series? it wouldn't be efficient but at least it'd be easy.

df[df.country.isin([c for c in df.county.cat.categories if c < 'z'])]
vs.
df[df.country.astype('object') <'z']

a little shorter and much clearer.

@jorisvandenbossche
Copy link
Member

There is:

In [6]: s = pd.Series([1, 2, 3, 1, 2, 1], dtype='category')

In [7]: s
Out[7]:
0    1
1    2
2    3
3    1
4    2
5    1
dtype: category
Categories (3, int64): [1, 2, 3]

In [8]: s.astype('int64')
Out[8]:
0    1
1    2
2    3
3    1
4    2
5    1
dtype: int64

same works with astype(str) for string categories.
But this is more or less the same as the np.array(s) I suggested (df[np.asarray(df.country) <'z']). But is indeed maybe a bit more idiomatic. Maybe we should mention that in the docs as well.

There's no reason not to support both concisely

Do you have a suggestion of an interface?

@kay1793
Copy link

kay1793 commented Apr 15, 2015

That's strange.

In [6]: df = pd.DataFrame({'x':np.arange(10), 'y':list('AAAABBBCCC')})
In [7]: foo = pd.Categorical(df.y, categories=['A', 'C', 'B'], ordered=True)
In [8]: df.y = foo

now foo doesn't have astype but df.y does. I understand why but that's a quirk.

Anyway, you're right for series with cat dtype astype is there and that's idiomatic and easy enough.

Do you have a suggestion of an interface?

You could have a proxy (available as an attr) that implemented the magic methods by looking up the value associated with the code invisibly. To be honest I'd just use astype because it would be a lot of work and I'm not convinced it matters enough in terms of performance to do it.

just call astype the recommended way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Categorical Categorical Data Type
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants