Skip to content

Attempt to fix issue #10366 encoding and categoricals hdf serialization. #10454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.17.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -765,3 +765,4 @@ Bug Fixes
- Bug in ``read_msgpack`` where encoding is not respected (:issue:`10580`)
- Bug preventing access to the first index when using ``iloc`` with a list containing the appropriate negative integer (:issue:`10547`, :issue:`10779`)
- Bug in ``TimedeltaIndex`` formatter causing error while trying to save ``DataFrame`` with ``TimedeltaIndex`` using ``to_csv`` (:issue:`10833`)
- Bug in ``Categorical`` hdf serialiation in presence of alternate encodings. (:issue:`10366`)
6 changes: 5 additions & 1 deletion pandas/io/pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -3049,7 +3049,8 @@ def write_metadata(self, key, values):

"""
values = Series(values)
self.parent.put(self._get_metadata_path(key), values, format='table')
self.parent.put(self._get_metadata_path(key), values, format='table',
encoding=self.encoding, nan_rep=self.nan_rep)

def read_metadata(self, key):
""" return the meta data array for this key """
Expand Down Expand Up @@ -4428,6 +4429,9 @@ def _unconvert_string_array(data, nan_rep=None, encoding=None):
dtype = "U{0}".format(itemsize)
else:
dtype = "S{0}".format(itemsize)
# fix? issue #10366
data = _convert_string_array(data, _ensure_encoding(encoding),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think this is necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the first outdated diff above. It seemed like the old code was intended to throw an exception and trigger the except section but it was not doing this. As per your comment I replace this with the call to _convert_string_array. I think this just encodes and sets the dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed
an exception is thrown on some conversions already this is duplicative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this line causes the tests to fail with the usual ValueError: Categorical categories must be unique. _convert_string_array seems to be throwing an error: 'bytes' object has no attribute 'encode'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are getting these exceptions then something else might be wrong. This should simply be changed to catch a very specific exception. The problem is that this hits a very non-performant patch if it excepts and we want to avoid that by a coding error (rather than an actual exception).

This might be an old bug IIRC. So would appreciate investigation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that _convert_string_array should return a unicode type (e.g. U) if their is an encoding. Try changing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look at this but didn't get see a way of getting around the np.vectorize line. I don't have a lot of time right now to dig into this so maybe there is a simple solution I am missing.

itemsize=itemsize)
data = data.astype(dtype, copy=False).astype(object, copy=False)
except (Exception) as e:
f = np.vectorize(lambda x: x.decode(encoding), otypes=[np.object])
Expand Down
45 changes: 45 additions & 0 deletions pandas/io/tests/test_pytables.py
Original file line number Diff line number Diff line change
Expand Up @@ -930,6 +930,51 @@ def test_encoding(self):
result = store.select('df',Term('columns=A',encoding='ascii'))
tm.assert_frame_equal(result,expected)

def test_latin_encoding(self):

if compat.PY2:
self.assertRaisesRegexp(TypeError, '\[unicode\] is not implemented as a table column')
return

values = [[b'E\xc9, 17', b'', b'a', b'b', b'c'],
[b'E\xc9, 17', b'a', b'b', b'c'],
[b'EE, 17', b'', b'a', b'b', b'c'],
[b'E\xc9, 17', b'\xf8\xfc', b'a', b'b', b'c'],
[b'', b'a', b'b', b'c'],
[b'\xf8\xfc', b'a', b'b', b'c'],
[b'A\xf8\xfc', b'', b'a', b'b', b'c'],
[np.nan, b'', b'b', b'c'],
[b'A\xf8\xfc', np.nan, b'', b'b', b'c']]

def _try_decode(x, encoding='latin-1'):
try:
return x.decode(encoding)
except AttributeError:
return x
# not sure how to remove latin-1 from code in python 2 and 3
values = [[_try_decode(x) for x in y] for y in values]

examples = []
for dtype in ['category', object]:
for val in values:
examples.append(pandas.Series(val, dtype=dtype))

def roundtrip(s, key='data', encoding='latin-1', nan_rep=''):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use the ensure_clean_path routines, dont' reinvent the wheel

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've used the self.path method here. No more mktemp.

with ensure_clean_path(self.path) as store:
s.to_hdf(store, key, format='table', encoding=encoding,
nan_rep=nan_rep)
retr = read_hdf(store, key)
s_nan = s.replace(nan_rep, np.nan)
assert_series_equal(s_nan, retr)

for s in examples:
roundtrip(s)

# fails:
# for x in examples:
# roundtrip(s, nan_rep=b'\xf8\xfc')


def test_append_some_nans(self):

with ensure_clean_store(self.path) as store:
Expand Down