-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Irregular errors when reading certain categorical strings from hdf #10366
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On Linux/Python 2.7/pandas: 0.16.2-9-g7636c2c (master) I get
What does |
@bashtage Sorry I forgot to mention any version platform information. I am on OSX and am getting the errors only in python3 (python2 behaves differently). I was on master but turned back to the latest release. Notice the "-coding-" EDIT in the example I gave above. It changes the look of the output slightly. Here is what I just tested now:
|
So this is a bug here. |
@jreback Thanks. I think I can get to this in a week or so. |
@cottrell gr8! |
It seems feeding the encoding and nan_rep through only fixed some of the errors. Basically, it looks like the categorical metadata is being mapped to "nan" for anything with non-standard encodings. Any suggestions on how to check whether the problem is with the writing or the reading of the hdf store? I can open the hdf store using pytables directly and I think the relevant node is '/data/meta/values/meta' where "data" is my top level key. |
hmm you will need to encode to nan string as well when reading everything needs to be decoded then the categorical created |
Posting some notes here as I go. https://github.com/pydata/pandas/blob/master/pandas/io/pytables.py#L4408 seems to be turning different encodings to nan. Commenting it out resolves the uniqueness exceptions but encodings are still not quite right. So it looks to me like the writing (to_hdf) is possible ok: In [94]: import tables
In [95]: f = tables.open_file('testhdf.h5', 'r')
In [96]: for r in f.root.data.meta.values.meta.table:
print(r['index'], r['values'])
....:
0 b''
1 b'E\xc3\x89, 17'
2 b'a'
3 b'b'
4 b'c' |
…f serialization.
It seems that there is something bad happening when we use certain strings with special characters AND the empty string with categoricals:
Results in:
Not sure if I am using this incorrectly or if this is actually a corner case.
The text was updated successfully, but these errors were encountered: