Skip to content

HDFStore fails to read non-ascii characters #11234

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
FilipDusek opened this issue Oct 4, 2015 · 7 comments
Closed

HDFStore fails to read non-ascii characters #11234

FilipDusek opened this issue Oct 4, 2015 · 7 comments
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Milestone

Comments

@FilipDusek
Copy link

When I try to save some non-ascii character like é and then load it again, I end up with UnicodeDecodeError. If you add some more data to the string (like 'aée'), the data gets stored and retrieved without error, but the result is missing the last character.

import pandas as pd

df = pd.DataFrame(columns=["A"])
toAppend = {"A": "é"}
df = df.append(toAppend, ignore_index = True)

store = pd.HDFStore(r'thiswillcrash.h5')
store.put('df', df, format='table', encoding="utf-8")
d = store["df"]
print(d)

store.close()

Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.16.2
nose: 1.3.4
Cython: 0.22
numpy: 1.9.3
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
bottleneck: None
tables: 3.2.2
numexpr: 2.4.4
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.6.7
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 0.9.9
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Oct 4, 2015

should be fixed by : #10889

give a try with v0.17.0rc2

conda install pandas -c pandas

@jreback jreback added the IO HDF5 read_hdf, HDFStore label Oct 4, 2015
@FilipDusek
Copy link
Author

No, unfortunately I still get the error

---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-3-3e6096eba1ca> in <module>()
      8 store = pd.HDFStore(r'iwillcrash30.h5')
      9 store.put('df', df, format='table', encoding="utf-8")
---> 10 d = store["df"]
     11 print(d)
     12 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in __getitem__(self, key)
    424 
    425     def __getitem__(self, key):
--> 426         return self.get(key)
    427 
    428     def __setitem__(self, key, value):

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in get(self, key)
    634         if group is None:
    635             raise KeyError('No object named %s in the file' % key)
--> 636         return self._read_group(group)
    637 
    638     def select(self, key, where=None, start=None, stop=None, columns=None,

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in _read_group(self, group, **kwargs)
   1271         s = self._create_storer(group)
   1272         s.infer_axes()
-> 1273         return s.read(**kwargs)
   1274 
   1275 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in read(self, where, columns, **kwargs)
   4004     def read(self, where=None, columns=None, **kwargs):
   4005 
-> 4006         if not self.read_axes(where=where, **kwargs):
   4007             return None
   4008 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in read_axes(self, where, **kwargs)
   3216         for a in self.axes:
   3217             a.set_info(self.info)
-> 3218             a.convert(values, nan_rep=self.nan_rep, encoding=self.encoding)
   3219 
   3220         return True

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in convert(self, values, nan_rep, encoding)
   2062         if _ensure_decoded(self.kind) == u('string'):
   2063             self.data = _unconvert_string_array(
-> 2064                 self.data, nan_rep=nan_rep, encoding=encoding)
   2065 
   2066         return self

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\io\pytables.py in _unconvert_string_array(data, nan_rep, encoding)
   4430 
   4431         if isinstance(data[0], compat.binary_type):
-> 4432             data = Series(data).str.decode(encoding).values
   4433         else:
   4434             data = data.astype(dtype, copy=False).astype(object, copy=False)

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in decode(self, encoding, errors)
   1310     @copy(str_decode)
   1311     def decode(self, encoding, errors="strict"):
-> 1312         result = str_decode(self.series, encoding, errors)
   1313         return self._wrap_result(result)
   1314 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in str_decode(arr, encoding, errors)
    979     """
    980     f = lambda x: x.decode(encoding, errors)
--> 981     return _na_map(f, arr)
    982 
    983 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in _na_map(f, arr, na_result, dtype)
    119 def _na_map(f, arr, na_result=np.nan, dtype=object):
    120     # should really _check_ for NA
--> 121     return _map(f, arr, na_mask=True, na_value=na_result, dtype=dtype)
    122 
    123 

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in _map(f, arr, na_mask, na_value, dtype)
    135         mask = isnull(arr)
    136         try:
--> 137             result = lib.map_infer_mask(arr, f, mask.view(np.uint8))
    138         except (TypeError, AttributeError):
    139             def g(x):

pandas\src\inference.pyx in pandas.lib.map_infer_mask (pandas\lib.c:61753)()

C:\Users\Filip\Anaconda3\lib\site-packages\pandas\core\strings.py in <lambda>(x)
    978     decoded : Series/Index of objects
    979     """
--> 980     f = lambda x: x.decode(encoding, errors)
    981     return _na_map(f, arr)
    982 

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 0: unexpected end of data

Versions

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None

pandas: 0.17.0rc2
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: 0.22.1
numpy: 1.9.3
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None

@TomAugspurger
Copy link
Contributor

@jreback looks like we truncate the column to be length 1 since len(df.iloc[0, 0]) is 1.

This works though

In [19]: df = pd.DataFrame({'A': ['é']})

In [20]: store = pd.HDFStore(r'thiswillcrash.h5')

In [21]: store.put('df', df, format='table', min_itemsize={'A': 30})

In [22]: store.get('df')
Out[22]:
   A
0  é

Do you have a good idea where a fix would go?

@jreback
Copy link
Contributor

jreback commented Oct 4, 2015

https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L972

is where the width of the strings are determined
but it should work for unicode

@TomAugspurger
Copy link
Contributor

Is it because the encoded length is different than the number of characters?

In [10]: x
Out[10]: 'é'

In [11]: len(x)
Out[11]: 1

In [12]: len(x.encode('utf-8'))
Out[12]: 2

@jreback
Copy link
Contributor

jreback commented Oct 4, 2015

yep should encode before we check and set the length

@jreback jreback added Bug Unicode Unicode strings labels Oct 5, 2015
@jreback jreback added this to the 0.17.1 milestone Oct 5, 2015
jreback pushed a commit that referenced this issue Oct 9, 2015
Failure came when the maximum length of the unencoded string
was smaller than the maximum encoded lenght.
@jreback
Copy link
Contributor

jreback commented Oct 9, 2015

closed by #11240

@jreback jreback closed this as completed Oct 9, 2015
yarikoptic added a commit to neurodebian/pandas that referenced this issue Oct 11, 2015
* commit 'v0.17.0-8-gcac4ad2': (57 commits)
  BUG: to_excel duplicate columns
  BUG: HDFStore.append with encoded string itemsize, pandas-dev#11234
  BUG: remove midrule in latex output with header=False
  BUG: squeeze works on 0 length arrays, pandas-dev#11299, pandas-dev#8999
  DOC: add whatsnew 0.17.1 to index
  DOC: update resample docs
  timeseries: add tip about using groupby() rather than resample
  DOC: release_stats.sh script to report release stats
  DOC: edit release.rst
  CI: fix numpy to 1.9.3 in 2.7,3.5 builds for now, as packages for 1.10.0 not released ATM
  DOC: Included halflife as one 3 optional params that must be specified
  DOC: whatsnew 0.17.0 edits
  BUG/ERR: raise when trying to set a subset of values in a datetime64[ns, tz] column with another tz
  DOC: Add note about unicode layout
  DOC: hack to numpydoc to include attributes that are None (GH6100)
  DOC: add str accessor docstring pages to api.rst to avoid warning
  DOC: hack to numpydoc to avoid warnings for Categorical (not including members)
  skip some plotting tests if scipy is not installed
  add matplotlib to ci for 3.5
  COMPAT/PERF: lib.ismember_int64 on older numpies/cython not comparing correctly PERF: use np.in1d on larger isin sizes
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Projects
None yet
Development

No branches or pull requests

3 participants