Skip to content

Categorical[idx].codes != Categorical.codes[idx] #9469

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jseabold opened this issue Feb 11, 2015 · 15 comments · Fixed by #9470
Closed

Categorical[idx].codes != Categorical.codes[idx] #9469

jseabold opened this issue Feb 11, 2015 · 15 comments · Fixed by #9470
Labels
Bug Categorical Categorical Data Type
Milestone

Comments

@jseabold
Copy link
Contributor

I'm unable to isolate this, but maybe this rings a bell for someone. It took me a long time to find the source of error in my code. I don't think this is intentional, and if so, is pretty dangerous.

[~/]
[22]: y[idx].codes[:15]
[22]: array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 1, 4], dtype=int8)

[~/]
[23]: y.codes[idx][:15]
[23]: array([4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4], dtype=int8)

[~/]
[24]: idx[:15]
[24]:
array([ 74172,  53885, 128953,  28125, 136548, 136700,  93633,  61147,
        56535,  90577, 115719,  58038, 111711,  53399,  77475])

[~/]
[25]: type(y)
[25]: pandas.core.categorical.Categorical

[~/]
[27]: pd.version.version
[27]: '0.15.2'
@jseabold
Copy link
Contributor Author

It seems that Categorical._maybe_coerce_indexer is transforming the int64 index to int8.

@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

The indexer is coerced to the same dtype as the codes (int8) in this case. This is necessary to have the indexing work properly (otherwise numpy complains). But I think if the indexer is out of bounds it will wrap around which is prob not what is expected.

can you show a construction for y?

@jreback jreback added Categorical Categorical Data Type Bug labels Feb 11, 2015
@jreback jreback added this to the 0.16.0 milestone Feb 11, 2015
@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

needless to say this is prob not tested well. and prob not used very much as this is almost always used from a Series. (this is somewhat legacy code)

@jseabold
Copy link
Contributor Author

I'm trying to make a MWE.

@jseabold
Copy link
Contributor Author

np.random.seed(1)
c = pd.Categorical(np.random.randint(0, 5, size=150000).astype(np.int8))
c.codes[np.array([100000]).astype(np.int64)]
c[np.array([100000]).astype(np.int64)].codes

@jseabold
Copy link
Contributor Author

In my case, y was created from some real data with 5 categories and the idx was created from sklearn cross-validation functions. Given the size of my y, int64s.

@jseabold
Copy link
Contributor Author

Not obvious to me why numpy would care between dtype of indexer and the dtype of the array. AFAIK, it just needs to be any integer type or boolean, no?

@jseabold
Copy link
Contributor Author

Removing the casting everything seems to work fine for me.

[4]: np.version.version
[4]: '1.10.0.dev+4ed1587'

@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

fixed in #9470 (just had to remove coercion), but didn't work for both setting and getting.

IIRC this was a problem on windows as it complained when trying to index (but maybe I am remembering incorrectly). I agree it should work.

If you can test on windows would be great, otherwise I can do next week.

@jseabold
Copy link
Contributor Author

I'll see if I can't find a windows box around. If not, can look in to it one night. Any min. numpy version?

@jreback
Copy link
Contributor

jreback commented Feb 11, 2015

numpy 1.7 is our min as of 0.15.0

@jorisvandenbossche
Copy link
Member

I tested it on Windows, and at least with Windows 7, 64bit, python 2.7 and numpy 1.9 this works (and the bug is fixed)

@jseabold
Copy link
Contributor Author

Fantastic. Thanks @jorisvandenbossche

Is this worth a 0.15.3? This is pretty serious IMO.

@jreback
Copy link
Contributor

jreback commented Feb 12, 2015

well considering it wasn't noticed till 3 mo after 0,15.2 no

0.16.0 is schedule somewhat soon anyhow

@jseabold
Copy link
Contributor Author

Well, I lost about 2 weeks of work bc. I was silently working with garbage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type
Projects
None yet
3 participants