Skip to content

BUG: when reading an HDF5 file with an encoding, default to using this encoding #11126

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jreback opened this issue Sep 16, 2015 · 18 comments
Closed
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings

Comments

@jreback
Copy link
Contributor

jreback commented Sep 16, 2015

http://stackoverflow.com/questions/32553207/values-missing-when-loaded-from-pandas-hdf5-file/32587108?noredirect=1#comment53034458_32587108

e.g.

if I

df.to_hdf('foo.h5','df',encoding='utf-8')

then

pd.read_hdf('foo.h5','df')

should pick up the encoding for each table (unless the user overrides by passing in encoding=...

@jreback jreback added Bug Unicode Unicode strings IO HDF5 read_hdf, HDFStore labels Sep 16, 2015
@jreback jreback added this to the Next Major Release milestone Sep 16, 2015
@andreabedini
Copy link
Contributor

I am investigating this since it has been reported on the PyTables mailing list too. It seems that specifying the encoding doesn't help much.

In [29]: store = pd.HDFStore('feeds.h5', mode='w')

In [30]: store.append('feeds', feeds_series[84:86], min_itemsize=200, encoding='utf-8')

In [31]: store.flush()

In [32]: store.select('feeds',encoding='utf-8')
Out[32]: 
84                                                     
85    April 2014. Blir MDG det nye arbeider @partiet...
Name: feeds, dtype: object

In [33]: feeds_series[84:86]
Out[33]: 
84    De statsbærende partiene Ap og Høyre  ta sky...
85    April 2014. Blir MDG det nye arbeider @partiet...
Name: feeds, dtype: object

@andreabedini
Copy link
Contributor

It's getting more interesting. If you save only the row 84 it works, same if you save 84, 85 and 86. Only when you save 84 and 85, 84 disappears.

In [41]: store = pd.HDFStore('feeds.h5', mode='w')

In [42]: store.append('a', feeds_series[84:85], min_itemsize=200, encoding='utf-8')

In [43]: store['a']
Out[43]: 
84    De statsbærende partiene Ap og Høyre  ta sky...
Name: feeds, dtype: object

In [44]: store.append('b', feeds_series[84:86], min_itemsize=200, encoding='utf-8')

In [45]: store['b']
Out[45]: 
84                                                     
85    April 2014. Blir MDG det nye arbeider @partiet...
Name: feeds, dtype: object

In [46]: store.append('c', feeds_series[84:87], min_itemsize=200, encoding='utf-8')

In [47]: store['c']
Out[47]: 
84    De statsbærende partiene Ap og Høyre  ta sky...
85    April 2014. Blir MDG det nye arbeider @partiet...
86                       MDG: Hasj for kjøtt. #valg2015
Name: feeds, dtype: object

@kawochen
Copy link
Contributor

@andreabedini This has been fixed on master I think.

@andreabedini
Copy link
Contributor

@kawochen ah cool, I can stop debugging then :P pointer to the solution?

@kawochen
Copy link
Contributor

@andreabedini I'm not sure where it was fixed, but the bug (about line 84) is reproducible with 0.16.2 but not with master.

@jreback
Copy link
Contributor Author

jreback commented Sep 18, 2015

We were not properly decoding, see here

So the OP that started this (on SO and PyTables mailing list) is fixed in master by the above, if you specify the encoding. THIS issue is about using the encoding that is already recorded in the meta-data (and using that).

@rspeer
Copy link

rspeer commented Feb 12, 2016

On the duplicate bug I reported, @jreback said "you need to pass the encoding when reading".

How do I pass the encoding? I don't see that as an option in http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_hdf.html .

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2016

yeah it's missing from the doc string that's should be addressed in the xref issue

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2016

encoding='utf8' or whatever u passed when u created it

@rspeer
Copy link

rspeer commented Feb 12, 2016

Sorry, that didn't work. I even tried re-saving it with encoding='utf-8'. The bytes of text saved in the file are indeed UTF-8, but they're still getting read as the empty string.

@rspeer
Copy link

rspeer commented Feb 12, 2016

@jreback To illustrate the problem:

>>> frame.index.has_duplicates
False
>>> frame.index.get_loc('café')
8240
>>> frame.to_hdf('/data/word2vec.h5', 'mat', encoding='utf-8')
>>> frame2 = pd.read_hdf('/data/word2vec.h5', 'mat', encoding='utf-8')
>>> frame2.index.has_duplicates
True
>>> frame2.index.get_duplicates()
['']
>>> frame2.index.get_loc('café')
...
KeyError: 'café'

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2016

@rspeer can you give a complete copy pastable example (e.g. need the frame creation in the first place), as well as pd.show_versions()

@rspeer
Copy link

rspeer commented Feb 12, 2016

I'll work on making a small example (I'm certainly not going to ask you to obtain all the data I'm using). I already reported pd.show_versions() in #12304, and I suggest that bug should be reopened because the title of this bug doesn't describe my problem.

@rspeer
Copy link

rspeer commented Feb 12, 2016

Wait. Okay, it was a good thing you asked about the versions. The problem is that I was running on the full dataset on a different machine that was on pandas 0.16. I'll let you know if upgrading fixes it.

@jreback
Copy link
Contributor Author

jreback commented Feb 12, 2016

ok, why don't you report it in that issue and i will reopen

@rspeer
Copy link

rspeer commented Mar 14, 2017

Sorry, I forgot about this issue, and it seems to be fixed.

@jreback
Copy link
Contributor Author

jreback commented Mar 14, 2017

@rspeer do you have a short example which didn't work and now does (that we can use as a test)?

@mroeschke
Copy link
Member

Generally sounds like this issue is solved by the other commenters, but unfortunately without a reproducible example. For now we can close this issue and reopen if we encounter this again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

5 participants