Categorical[idx].codes != Categorical.codes[idx] #9469

jseabold · 2015-02-11T20:53:55Z

I'm unable to isolate this, but maybe this rings a bell for someone. It took me a long time to find the source of error in my code. I don't think this is intentional, and if so, is pretty dangerous.

[~/]
[22]: y[idx].codes[:15]
[22]: array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 4], dtype=int8)

[~/]
[23]: y.codes[idx][:15]
[23]: array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], dtype=int8)

[~/]
[24]: idx[:15]
[24]:
array([ 74172,  53885, 128953,  28125, 136548, 136700,  93633,  61147,
        56535,  90577, 115719,  58038, 111711,  53399,  77475])

[~/]
[25]: type(y)
[25]: pandas.core.categorical.Categorical

[~/]
[27]: pd.version.version
[27]: '0.15.2'

The text was updated successfully, but these errors were encountered:

jseabold · 2015-02-11T21:02:13Z

It seems that Categorical._maybe_coerce_indexer is transforming the int64 index to int8.

jreback · 2015-02-11T21:02:19Z

The indexer is coerced to the same dtype as the codes (int8) in this case. This is necessary to have the indexing work properly (otherwise numpy complains). But I think if the indexer is out of bounds it will wrap around which is prob not what is expected.

can you show a construction for y?

jreback · 2015-02-11T21:03:24Z

needless to say this is prob not tested well. and prob not used very much as this is almost always used from a Series. (this is somewhat legacy code)

jseabold · 2015-02-11T21:04:44Z

I'm trying to make a MWE.

jseabold · 2015-02-11T21:07:17Z

np.random.seed(1)
c = pd.Categorical(np.random.randint(0, 5, size=150000).astype(np.int8))
c.codes[np.array([100000]).astype(np.int64)]
c[np.array([100000]).astype(np.int64)].codes

jseabold · 2015-02-11T21:09:09Z

In my case, y was created from some real data with 5 categories and the idx was created from sklearn cross-validation functions. Given the size of my y, int64s.

jseabold · 2015-02-11T21:13:39Z

Not obvious to me why numpy would care between dtype of indexer and the dtype of the array. AFAIK, it just needs to be any integer type or boolean, no?

jseabold · 2015-02-11T21:29:35Z

Removing the casting everything seems to work fine for me.

[4]: np.version.version
[4]: '1.10.0.dev+4ed1587'

jreback · 2015-02-11T21:34:13Z

fixed in #9470 (just had to remove coercion), but didn't work for both setting and getting.

IIRC this was a problem on windows as it complained when trying to index (but maybe I am remembering incorrectly). I agree it should work.

If you can test on windows would be great, otherwise I can do next week.

jseabold · 2015-02-11T21:36:35Z

I'll see if I can't find a windows box around. If not, can look in to it one night. Any min. numpy version?

jreback · 2015-02-11T21:42:38Z

numpy 1.7 is our min as of 0.15.0

jorisvandenbossche · 2015-02-12T14:42:24Z

I tested it on Windows, and at least with Windows 7, 64bit, python 2.7 and numpy 1.9 this works (and the bug is fixed)

jseabold · 2015-02-12T17:50:58Z

Fantastic. Thanks @jorisvandenbossche

Is this worth a 0.15.3? This is pretty serious IMO.

jreback · 2015-02-12T18:00:09Z

well considering it wasn't noticed till 3 mo after 0,15.2 no

0.16.0 is schedule somewhat soon anyhow

jseabold · 2015-02-12T20:15:50Z

Well, I lost about 2 weeks of work bc. I was silently working with garbage.

jreback added Categorical Categorical Data Type Bug labels Feb 11, 2015

jreback added this to the 0.16.0 milestone Feb 11, 2015

jreback mentioned this issue Feb 11, 2015

BUG: Bug in Categorical.__getitem__/__setitem__ with listlike input getting incorrect result from indexer coercion (GH9469) #9470

Merged

jreback closed this as completed in #9470 Feb 12, 2015

jreback mentioned this issue Feb 25, 2015

Setting a value in a categorical column using boolean indexing fails for large index values #9554

Closed

jseabold mentioned this issue Jun 19, 2015

Fix pandas version fonnesbeck/ngcm_pandas_course#8

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Categorical[idx].codes != Categorical.codes[idx] #9469

Categorical[idx].codes != Categorical.codes[idx] #9469

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jreback commented Feb 11, 2015

jreback commented Feb 11, 2015

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jreback commented Feb 11, 2015

jseabold commented Feb 11, 2015

jreback commented Feb 11, 2015

jorisvandenbossche commented Feb 12, 2015

jseabold commented Feb 12, 2015

jreback commented Feb 12, 2015

jseabold commented Feb 12, 2015

Categorical[idx].codes != Categorical.codes[idx] #9469

Categorical[idx].codes != Categorical.codes[idx] #9469

Comments

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jreback commented Feb 11, 2015

jreback commented Feb 11, 2015

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jseabold commented Feb 11, 2015

jreback commented Feb 11, 2015

jseabold commented Feb 11, 2015

jreback commented Feb 11, 2015

jorisvandenbossche commented Feb 12, 2015

jseabold commented Feb 12, 2015

jreback commented Feb 12, 2015

jseabold commented Feb 12, 2015