-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
HDFStore fails to read non-ascii characters #11234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
should be fixed by : #10889 give a try with
|
No, unfortunately I still get the error
Versions INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Windows
OS-release: 8
machine: AMD64
processor: Intel64 Family 6 Model 69 Stepping 1, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
pandas: 0.17.0rc2
nose: 1.3.7
pip: 7.1.2
setuptools: 18.3.2
Cython: 0.22.1
numpy: 1.9.3
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.2.0
sphinx: 1.3.1
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.6
blosc: None
bottleneck: 1.0.0
tables: 3.2.0
numexpr: 2.4.3
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 1.0.0
xlsxwriter: 0.7.3
lxml: 3.4.4
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: 1.0.5
pymysql: None
psycopg2: None |
@jreback looks like we truncate the column to be length 1 since This works though In [19]: df = pd.DataFrame({'A': ['é']})
In [20]: store = pd.HDFStore(r'thiswillcrash.h5')
In [21]: store.put('df', df, format='table', min_itemsize={'A': 30})
In [22]: store.get('df')
Out[22]:
A
0 é Do you have a good idea where a fix would go? |
https://github.com/pydata/pandas/blob/master/pandas/lib.pyx#L972 is where the width of the strings are determined |
Is it because the encoded length is different than the number of characters? In [10]: x
Out[10]: 'é'
In [11]: len(x)
Out[11]: 1
In [12]: len(x.encode('utf-8'))
Out[12]: 2 |
yep should encode before we check and set the length |
Failure came when the maximum length of the unencoded string was smaller than the maximum encoded lenght.
closed by #11240 |
* commit 'v0.17.0-8-gcac4ad2': (57 commits) BUG: to_excel duplicate columns BUG: HDFStore.append with encoded string itemsize, pandas-dev#11234 BUG: remove midrule in latex output with header=False BUG: squeeze works on 0 length arrays, pandas-dev#11299, pandas-dev#8999 DOC: add whatsnew 0.17.1 to index DOC: update resample docs timeseries: add tip about using groupby() rather than resample DOC: release_stats.sh script to report release stats DOC: edit release.rst CI: fix numpy to 1.9.3 in 2.7,3.5 builds for now, as packages for 1.10.0 not released ATM DOC: Included halflife as one 3 optional params that must be specified DOC: whatsnew 0.17.0 edits BUG/ERR: raise when trying to set a subset of values in a datetime64[ns, tz] column with another tz DOC: Add note about unicode layout DOC: hack to numpydoc to include attributes that are None (GH6100) DOC: add str accessor docstring pages to api.rst to avoid warning DOC: hack to numpydoc to avoid warnings for Categorical (not including members) skip some plotting tests if scipy is not installed add matplotlib to ci for 3.5 COMPAT/PERF: lib.ismember_int64 on older numpies/cython not comparing correctly PERF: use np.in1d on larger isin sizes ...
When I try to save some non-ascii character like é and then load it again, I end up with UnicodeDecodeError. If you add some more data to the string (like 'aée'), the data gets stored and retrieved without error, but the result is missing the last character.
Versions
The text was updated successfully, but these errors were encountered: