Skip to content

Attempt to fix issue #10366 encoding and categoricals hdf serialization. #10454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

cottrell
Copy link
Contributor

closes #10366 .

Probably not quite the right approach but want to run travis. Tests are a bit weak at this point (just testing for non-exceptions). Encoding issues might be relevant to other types besides categoricals (but categoricals raise exceptions when strings to mangled to non-uniqueness).

# it seems like for categoricals, when data.dtype is object the data.astype line does
# not raise exception (in python3) and I think this is expected in
# order to trigger the decoding in the except below. TODO review.
if encoding is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use _convert_string_array(data, _ensure_encoding(encoding)) (just to reuse things)

I guess its possible that an min_itemsize was passed for this column as well, so pass that thru too.

@jreback
Copy link
Contributor

jreback commented Jun 27, 2015

Also, let's make a store in 0.16.2. and try it on the current to see if back-compat can be rescued. (but do this after you have it completely working; you might have to try things to 'detect' the encoding, e.g. try to re-encode in 'utf-8' and some such).

@jreback jreback added Bug Unicode Unicode strings IO HDF5 read_hdf, HDFStore labels Jun 27, 2015
@jreback jreback added this to the 0.17.0 milestone Jun 27, 2015
@cottrell cottrell force-pushed the categ_hdf branch 3 times, most recently from 19d7dd0 to 4d6b222 Compare June 28, 2015 13:24
@cottrell
Copy link
Contributor Author

I haven't resolved the python2 issue (TypeError: [unicode] is not implemented as a table column) so have put a skip in the test for PY2.

In Python 3.4.3, have successfully created a store in 0.16.2 and read it using the current branch.

@@ -4405,6 +4406,9 @@ def _unconvert_string_array(data, nan_rep=None, encoding=None):
dtype = "U{0}".format(itemsize)
else:
dtype = "S{0}".format(itemsize)
# fix? issue #10366
data = _convert_string_array(data, _ensure_encoding(encoding),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you think this is necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is from the first outdated diff above. It seemed like the old code was intended to throw an exception and trigger the except section but it was not doing this. As per your comment I replace this with the call to _convert_string_array. I think this just encodes and sets the dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not needed
an exception is thrown on some conversions already this is duplicative

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this line causes the tests to fail with the usual ValueError: Categorical categories must be unique. _convert_string_array seems to be throwing an error: 'bytes' object has no attribute 'encode'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are getting these exceptions then something else might be wrong. This should simply be changed to catch a very specific exception. The problem is that this hits a very non-performant patch if it excepts and we want to avoid that by a coding error (rather than an actual exception).

This might be an old bug IIRC. So would appreciate investigation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem is that _convert_string_array should return a unicode type (e.g. U) if their is an encoding. Try changing this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had a quick look at this but didn't get see a way of getting around the np.vectorize line. I don't have a lot of time right now to dig into this so maybe there is a simple solution I am missing.

@cottrell
Copy link
Contributor Author

cottrell commented Jul 3, 2015

@jreback I've squashed. Let me know if there are other comments/concerns.


if compat.PY2:
# in Python 2: TypeError: [unicode] is not implemented as a table
# column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.assertRaises (and just put this at the top)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Not sure what you mean by "put at the top". assertRaise calls _test which is defined first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if u out the py2 check at the beginning of the test then u don't need to have define _test then call it
just pull the code up one level

u had it this way before iirc

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change this use assertRaisesRegExp to make sure that the correct exception is raised (iow this matches in the text of the exception as well)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added assertRaisesRegExp and put the PY check above. To do this I had to create an empty contextmanager for the PY3 case (when no exception is expected). Is there a better way?

@cottrell cottrell force-pushed the categ_hdf branch 2 times, most recently from 394fee0 to 631df48 Compare July 4, 2015 10:47
# issue GH10366

if compat.PY2:
# in Python 2: TypeError: [unicode] is not implemented as a table
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you don't need all of this complication

def test_latin_encoding(self):

    if compat.PY2:
        self.assertRaisesRegexp(....)
        return

    values = ......

@jreback
Copy link
Contributor

jreback commented Aug 5, 2015

can you rebase / update?

@cottrell
Copy link
Contributor Author

cottrell commented Aug 7, 2015

Rebased and squashed.

@@ -1,4 +1,5 @@
import nose
from nose.tools import assert_raises
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra import?

@cottrell cottrell force-pushed the categ_hdf branch 2 times, most recently from f199efc to 7e0d6dd Compare August 8, 2015 18:46
@jreback
Copy link
Contributor

jreback commented Aug 20, 2015

@cottrell can you rebase and update according to my last

@cottrell
Copy link
Contributor Author

@jreback I have rebased and commented above that I did not get very far with the _convert_string_array suggestion you made.

@jreback
Copy link
Contributor

jreback commented Aug 22, 2015

ok will take a look thxs

@jreback
Copy link
Contributor

jreback commented Aug 22, 2015

replaced by #10889

@jreback jreback closed this Aug 22, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Irregular errors when reading certain categorical strings from hdf
2 participants