Attempt to fix issue #10366 encoding and categoricals hdf serialization. #10454

cottrell · 2015-06-27T14:47:44Z

closes #10366 .

Probably not quite the right approach but want to run travis. Tests are a bit weak at this point (just testing for non-exceptions). Encoding issues might be relevant to other types besides categoricals (but categoricals raise exceptions when strings to mangled to non-uniqueness).

jreback · 2015-06-27T14:52:03Z

pandas/io/pytables.py

+            # it seems like for categoricals, when data.dtype is object the data.astype line does
+            # not raise exception (in python3) and I think this is expected in
+            # order to trigger the decoding in the except below. TODO review.
+            if encoding is not None:


use _convert_string_array(data, _ensure_encoding(encoding)) (just to reuse things)

I guess its possible that an min_itemsize was passed for this column as well, so pass that thru too.

jreback · 2015-06-27T14:54:55Z

Also, let's make a store in 0.16.2. and try it on the current to see if back-compat can be rescued. (but do this after you have it completely working; you might have to try things to 'detect' the encoding, e.g. try to re-encode in 'utf-8' and some such).

cottrell · 2015-06-29T09:32:51Z

I haven't resolved the python2 issue (TypeError: [unicode] is not implemented as a table column) so have put a skip in the test for PY2.

In Python 3.4.3, have successfully created a store in 0.16.2 and read it using the current branch.

jreback · 2015-06-29T11:43:57Z

pandas/io/pytables.py

@@ -4405,6 +4406,9 @@ def _unconvert_string_array(data, nan_rep=None, encoding=None):
                dtype = "U{0}".format(itemsize)
            else:
                dtype = "S{0}".format(itemsize)
+            # fix? issue #10366
+            data = _convert_string_array(data, _ensure_encoding(encoding),


why do you think this is necessary here?

This is from the first outdated diff above. It seemed like the old code was intended to throw an exception and trigger the except section but it was not doing this. As per your comment I replace this with the call to _convert_string_array. I think this just encodes and sets the dtype.

this is not needed
an exception is thrown on some conversions already this is duplicative

Removing this line causes the tests to fail with the usual ValueError: Categorical categories must be unique. _convert_string_array seems to be throwing an error: 'bytes' object has no attribute 'encode'

If you are getting these exceptions then something else might be wrong. This should simply be changed to catch a very specific exception. The problem is that this hits a very non-performant patch if it excepts and we want to avoid that by a coding error (rather than an actual exception).

This might be an old bug IIRC. So would appreciate investigation.

I think the problem is that _convert_string_array should return a unicode type (e.g. U) if their is an encoding. Try changing this.

I had a quick look at this but didn't get see a way of getting around the np.vectorize line. I don't have a lot of time right now to dig into this so maybe there is a simple solution I am missing.

cottrell · 2015-07-03T12:28:19Z

@jreback I've squashed. Let me know if there are other comments/concerns.

jreback · 2015-07-03T15:30:37Z

pandas/io/tests/test_pytables.py

+
+        if compat.PY2:
+            # in Python 2: TypeError: [unicode] is not implemented as a table
+            # column


use self.assertRaises (and just put this at the top)

done. Not sure what you mean by "put at the top". assertRaise calls _test which is defined first.

if u out the py2 check at the beginning of the test then u don't need to have define _test then call it
just pull the code up one level

u had it this way before iirc

change this use assertRaisesRegExp to make sure that the correct exception is raised (iow this matches in the text of the exception as well)

I've added assertRaisesRegExp and put the PY check above. To do this I had to create an empty contextmanager for the PY3 case (when no exception is expected). Is there a better way?

jreback · 2015-07-05T16:17:56Z

pandas/io/tests/test_pytables.py

+        # issue GH10366
+
+        if compat.PY2:
+            # in Python 2: TypeError: [unicode] is not implemented as a table


you don't need all of this complication

def test_latin_encoding(self): if compat.PY2: self.assertRaisesRegexp(....) return values = ......

jreback · 2015-08-05T21:37:24Z

can you rebase / update?

cottrell · 2015-08-07T19:43:26Z

Rebased and squashed.

jreback · 2015-08-08T00:22:36Z

pandas/io/tests/test_pytables.py

@@ -1,4 +1,5 @@
 import nose
+from nose.tools import assert_raises


extra import?

jreback · 2015-08-20T13:19:51Z

@cottrell can you rebase and update according to my last

…f serialization.

cottrell · 2015-08-22T18:13:34Z

@jreback I have rebased and commented above that I did not get very far with the _convert_string_array suggestion you made.

jreback · 2015-08-22T18:27:23Z

ok will take a look thxs

jreback · 2015-08-22T20:21:50Z

replaced by #10889

jreback reviewed Jun 27, 2015
View reviewed changes

jreback added Bug Unicode Unicode strings IO HDF5 read_hdf, HDFStore labels Jun 27, 2015

jreback added this to the 0.17.0 milestone Jun 27, 2015

cottrell force-pushed the categ_hdf branch 3 times, most recently from 19d7dd0 to 4d6b222 Compare June 28, 2015 13:24

jreback reviewed Jun 29, 2015
View reviewed changes

cottrell force-pushed the categ_hdf branch from 7f7cff9 to fbdb05c Compare July 2, 2015 22:10

jreback reviewed Jul 3, 2015
View reviewed changes

cottrell force-pushed the categ_hdf branch 2 times, most recently from 394fee0 to 631df48 Compare July 4, 2015 10:47

jreback reviewed Jul 5, 2015
View reviewed changes

cottrell force-pushed the categ_hdf branch from bfd6bff to 0151c76 Compare August 6, 2015 21:22

jreback reviewed Aug 8, 2015
View reviewed changes

pandas/io/tests/test_pytables.py

@@ -1,4 +1,5 @@

import nose

from nose.tools import assert_raises

Copy link

Contributor

jreback Aug 8, 2015

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra import?

cottrell force-pushed the categ_hdf branch 2 times, most recently from f199efc to 7e0d6dd Compare August 8, 2015 18:46

Add tests and fix issue pandas-dev#10366 encoding and categoricals hd…

8463c63

…f serialization.

cottrell force-pushed the categ_hdf branch from 7e0d6dd to 8463c63 Compare August 22, 2015 18:11

jreback mentioned this pull request Aug 22, 2015

BUG: encoding of categoricals in hdf serialization #10889

Merged

jreback closed this Aug 22, 2015

		@@ -1,4 +1,5 @@
		import nose
		from nose.tools import assert_raises

Uh oh!

Attempt to fix issue #10366 encoding and categoricals hdf serialization. #10454

Attempt to fix issue #10366 encoding and categoricals hdf serialization. #10454

Uh oh!

Conversation

cottrell commented Jun 27, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 27, 2015

Uh oh!

cottrell commented Jun 29, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cottrell commented Jul 3, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 5, 2015

Uh oh!

cottrell commented Aug 7, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 20, 2015

Uh oh!

cottrell commented Aug 22, 2015

Uh oh!

jreback commented Aug 22, 2015

Uh oh!

jreback commented Aug 22, 2015

Uh oh!

Uh oh!