Skip to content

DataFrame.to_hdf will write Unicode labels, but pd.read_hdf won't read them #12304

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rspeer opened this issue Feb 12, 2016 · 3 comments
Closed
Labels
IO HDF5 read_hdf, HDFStore Unicode Unicode strings

Comments

@rspeer
Copy link

rspeer commented Feb 12, 2016

I've been having a very frustrating problem where I could save a DataFrame with unique labels to HDF5, and load it again, and find that the loaded version had non-unique labels. It turns out this is because pd.read_hdf is replacing all labels containing non-ASCII characters with the empty string. For example, a row with the label 'café' will have the label '' when it is loaded again.

I looked at the bytes that were written in the HDF5 file, and confirmed that the proper UTF-8 text was in there, so the bug is in read_hdf.

Version information:

>>> pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 3.5.0.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-51-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 1.5.4
setuptools: 2.2
Cython: 0.23.3
numpy: 1.10.4
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.3.1
patsy: 0.4.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: None
numexpr: None
matplotlib: 1.4.3
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: 0.7.3
lxml: None
bs4: 4.4.1
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.6.1 (dt dec pq3 ext lo64)
Jinja2: None
@jreback jreback added Unicode Unicode strings Duplicate Report Duplicate issue or pull request IO HDF5 read_hdf, HDFStore labels Feb 12, 2016
@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

dupe of #11126

you need to pass the encoding when reading as its not currently recorded

@jreback jreback closed this as completed Feb 12, 2016
@jreback jreback reopened this Feb 12, 2016
@jreback jreback removed the Duplicate Report Duplicate issue or pull request label Feb 12, 2016
@jreback jreback changed the title DataFrame.to_hdf will write Unicode labels, but pd.read_hdf won't read them DataFrame.to_hdf will write Unicode labels, but pd.read_hdf won't read them Feb 12, 2016
@rspeer
Copy link
Author

rspeer commented Feb 12, 2016

My apologies -- this was just because I was reading data saved from 0.17 on a machine that was running 0.16. It seems that 0.17 even does the right thing when I don't specify an encoding at all.

Thanks for your help!

@rspeer rspeer closed this as completed Feb 12, 2016
@jreback
Copy link
Contributor

jreback commented Feb 12, 2016

gr8! yeah we don't support forward compat like that, though for the most part is generally will work (but things did change a bit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

2 participants