Skip to content

ValueError when reading a Dataframe with HDFStore in Python 3 from fixed format written in Python 2 #24404

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
faulaire opened this issue Dec 23, 2018 · 4 comments · Fixed by #24510
Labels
IO HDF5 read_hdf, HDFStore
Milestone

Comments

@faulaire
Copy link
Contributor

Code Sample, a copy-pastable example if possible

# Part of the code to be executed in Python 2
import pandas as pd
df = pd.DataFrame([[1, 2, 3, "D"]],
                  columns=['A', 'B', 'C', 'D'],
                  index=pd.Index(['ABC'], name='INDEX_NAME'))

store = pd.HDFStore("test.hdf", mode='w')
store.put(value=df, key="df", format='fixed')
store.close()

# Part of the code to be executed in Python 3
import pandas as pd
store = pd.HDFStore('test.hdf', mode='r')
df = store['df']

Problem description

When I run the above code (The first part with a Python 2 environment and the second part with a Python 3, see details below) , I get the following exception:

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-30b5710694c8> in <module>
----> 1 df = store['df']

~/miniconda3/envs/dev-env-py37/lib/python3.6/site-packages/pandas/io/pytables.py in __getitem__(self, key)
    505 
    506     def __getitem__(self, key):
--> 507         return self.get(key)
    508 
    509     def __setitem__(self, key, value):

~/miniconda3/envs/dev-env-py37/lib/python3.6/site-packages/pandas/io/pytables.py in get(self, key)
    693         if group is None:
    694             raise KeyError('No object named %s in the file' % key)
--> 695         return self._read_group(group)
    696 
    697     def select(self, key, where=None, start=None, stop=None, columns=None,

~/miniconda3/envs/dev-env-py37/lib/python3.6/site-packages/pandas/io/pytables.py in _read_group(self, group, **kwargs)
   1373         s = self._create_storer(group)
   1374         s.infer_axes()
-> 1375         return s.read(**kwargs)
   1376 
   1377 

~/miniconda3/envs/dev-env-py37/lib/python3.6/site-packages/pandas/io/pytables.py in read(self, start, stop, **kwargs)
   2926 
   2927             _start, _stop = (start, stop) if i == select_axis else (None, None)
-> 2928             ax = self.read_index('axis%d' % i, start=_start, stop=_stop)
   2929             axes.append(ax)
   2930 

~/miniconda3/envs/dev-env-py37/lib/python3.6/site-packages/pandas/io/pytables.py in read_index(self, key, **kwargs)
   2521             return self.read_sparse_intindex(key, **kwargs)
   2522         elif variety == u('regular'):
-> 2523             _, index = self.read_index_node(getattr(self.group, key), **kwargs)
   2524             return index
   2525         else:  # pragma: no cover

~/miniconda3/envs/dev-env-py37/lib/python3.6/site-packages/pandas/io/pytables.py in read_index_node(self, node, start, stop)
   2651             index = factory(_unconvert_index(data, kind,
   2652                                              encoding=self.encoding,
-> 2653                                              errors=self.errors), **kwargs)
   2654 
   2655         index.name = name

~/miniconda3/envs/dev-env-py37/lib/python3.6/site-packages/pandas/io/pytables.py in _unconvert_index(data, kind, encoding, errors)
   4561     elif kind in (u('string')):
   4562         index = _unconvert_string_array(data, nan_rep=None, encoding=encoding,
-> 4563                                         errors=errors)
   4564     elif kind == u('object'):
   4565         index = np.asarray(data[0])

~/miniconda3/envs/dev-env-py37/lib/python3.6/site-packages/pandas/io/pytables.py in _unconvert_string_array(data, nan_rep, encoding, errors)
   4654         nan_rep = 'nan'
   4655 
-> 4656     data = libwriters.string_array_replace_from_nan_rep(data, nan_rep)
   4657     return data.reshape(shape)
   4658 

pandas/_libs/writers.pyx in pandas._libs.writers.string_array_replace_from_nan_rep()

ValueError: Buffer dtype mismatch, expected 'Python object' but got 'double'

This exception is raised only in 'fixed' format, no issues with table format (The DataFrame is read correctly).

Expected Output

The expected output is the same DataFrame as the one wrote in Python 2.

Output of pd.show_versions()

[Python 2]

INSTALLED VERSIONS

commit: None
python: 2.7.15.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.59+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 5.8.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

[Python 3]

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Linux
OS-release: 4.4.59+
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: None
numpy: 1.15.4
scipy: None
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: None
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@gfyoung gfyoung added 2/3 Compat IO HDF5 read_hdf, HDFStore labels Dec 30, 2018
@gfyoung
Copy link
Member

gfyoung commented Dec 30, 2018

cc @jreback

@jreback
Copy link
Contributor

jreback commented Dec 30, 2018

we don't have a lot of guarantees around fixed format migration from py2 to py3. would have to have user investigation & a community PR to fix.

@faulaire
Copy link
Contributor Author

This pull request partially solve the issue, the DataFrame is now readable but the name of the index is encoded as 'bytes_' ( just like in previous versions of pandas).

@jreback jreback added this to the 0.24.0 milestone Dec 31, 2018
@anuragupadhyaya
Copy link

This issue is still there in groupby operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants