-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: when reading an HDF5 file with an encoding, default to using this encoding #11126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I am investigating this since it has been reported on the PyTables mailing list too. It seems that specifying the encoding doesn't help much. In [29]: store = pd.HDFStore('feeds.h5', mode='w')
In [30]: store.append('feeds', feeds_series[84:86], min_itemsize=200, encoding='utf-8')
In [31]: store.flush()
In [32]: store.select('feeds',encoding='utf-8')
Out[32]:
84
85 April 2014. Blir MDG det nye arbeider @partiet...
Name: feeds, dtype: object
In [33]: feeds_series[84:86]
Out[33]:
84 De statsbærende partiene Ap og Høyre må ta sky...
85 April 2014. Blir MDG det nye arbeider @partiet...
Name: feeds, dtype: object |
It's getting more interesting. If you save only the row 84 it works, same if you save 84, 85 and 86. Only when you save 84 and 85, 84 disappears. In [41]: store = pd.HDFStore('feeds.h5', mode='w')
In [42]: store.append('a', feeds_series[84:85], min_itemsize=200, encoding='utf-8')
In [43]: store['a']
Out[43]:
84 De statsbærende partiene Ap og Høyre må ta sky...
Name: feeds, dtype: object
In [44]: store.append('b', feeds_series[84:86], min_itemsize=200, encoding='utf-8')
In [45]: store['b']
Out[45]:
84
85 April 2014. Blir MDG det nye arbeider @partiet...
Name: feeds, dtype: object
In [46]: store.append('c', feeds_series[84:87], min_itemsize=200, encoding='utf-8')
In [47]: store['c']
Out[47]:
84 De statsbærende partiene Ap og Høyre må ta sky...
85 April 2014. Blir MDG det nye arbeider @partiet...
86 MDG: Hasj for kjøtt. #valg2015
Name: feeds, dtype: object |
@andreabedini This has been fixed on master I think. |
@kawochen ah cool, I can stop debugging then :P pointer to the solution? |
@andreabedini I'm not sure where it was fixed, but the bug (about line 84) is reproducible with 0.16.2 but not with master. |
We were not properly decoding, see here So the OP that started this (on SO and PyTables mailing list) is fixed in master by the above, if you specify the encoding. THIS issue is about using the encoding that is already recorded in the meta-data (and using that). |
On the duplicate bug I reported, @jreback said "you need to pass the encoding when reading". How do I pass the encoding? I don't see that as an option in http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_hdf.html . |
yeah it's missing from the doc string that's should be addressed in the xref issue |
encoding='utf8' or whatever u passed when u created it |
Sorry, that didn't work. I even tried re-saving it with encoding='utf-8'. The bytes of text saved in the file are indeed UTF-8, but they're still getting read as the empty string. |
@jreback To illustrate the problem:
|
@rspeer can you give a complete copy pastable example (e.g. need the frame creation in the first place), as well as |
I'll work on making a small example (I'm certainly not going to ask you to obtain all the data I'm using). I already reported |
Wait. Okay, it was a good thing you asked about the versions. The problem is that I was running on the full dataset on a different machine that was on pandas 0.16. I'll let you know if upgrading fixes it. |
ok, why don't you report it in that issue and i will reopen |
Sorry, I forgot about this issue, and it seems to be fixed. |
@rspeer do you have a short example which didn't work and now does (that we can use as a test)? |
Generally sounds like this issue is solved by the other commenters, but unfortunately without a reproducible example. For now we can close this issue and reopen if we encounter this again |
http://stackoverflow.com/questions/32553207/values-missing-when-loaded-from-pandas-hdf5-file/32587108?noredirect=1#comment53034458_32587108
e.g.
if I
then
should pick up the encoding for each table (unless the user overrides by passing in
encoding=...
The text was updated successfully, but these errors were encountered: