-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Attempt to fix issue #10366 encoding and categoricals hdf serialization. #10454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
# it seems like for categoricals, when data.dtype is object the data.astype line does | ||
# not raise exception (in python3) and I think this is expected in | ||
# order to trigger the decoding in the except below. TODO review. | ||
if encoding is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use _convert_string_array(data, _ensure_encoding(encoding))
(just to reuse things)
I guess its possible that an min_itemsize
was passed for this column as well, so pass that thru too.
Also, let's make a store in 0.16.2. and try it on the current to see if back-compat can be rescued. (but do this after you have it completely working; you might have to try things to 'detect' the encoding, e.g. try to re-encode in 'utf-8' and some such). |
19d7dd0
to
4d6b222
Compare
I haven't resolved the python2 issue (TypeError: [unicode] is not implemented as a table column) so have put a skip in the test for PY2. In Python 3.4.3, have successfully created a store in 0.16.2 and read it using the current branch. |
@@ -4405,6 +4406,9 @@ def _unconvert_string_array(data, nan_rep=None, encoding=None): | |||
dtype = "U{0}".format(itemsize) | |||
else: | |||
dtype = "S{0}".format(itemsize) | |||
# fix? issue #10366 | |||
data = _convert_string_array(data, _ensure_encoding(encoding), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you think this is necessary here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is from the first outdated diff above. It seemed like the old code was intended to throw an exception and trigger the except section but it was not doing this. As per your comment I replace this with the call to _convert_string_array. I think this just encodes and sets the dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not needed
an exception is thrown on some conversions already this is duplicative
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this line causes the tests to fail with the usual ValueError: Categorical categories must be unique. _convert_string_array seems to be throwing an error: 'bytes' object has no attribute 'encode'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are getting these exceptions then something else might be wrong. This should simply be changed to catch a very specific exception. The problem is that this hits a very non-performant patch if it excepts and we want to avoid that by a coding error (rather than an actual exception).
This might be an old bug IIRC. So would appreciate investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem is that _convert_string_array
should return a unicode type (e.g. U
) if their is an encoding. Try changing this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had a quick look at this but didn't get see a way of getting around the np.vectorize line. I don't have a lot of time right now to dig into this so maybe there is a simple solution I am missing.
@jreback I've squashed. Let me know if there are other comments/concerns. |
|
||
if compat.PY2: | ||
# in Python 2: TypeError: [unicode] is not implemented as a table | ||
# column |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.assertRaises (and just put this at the top)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. Not sure what you mean by "put at the top". assertRaise calls _test which is defined first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if u out the py2 check at the beginning of the test then u don't need to have define _test then call it
just pull the code up one level
u had it this way before iirc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change this use assertRaisesRegExp to make sure that the correct exception is raised (iow this matches in the text of the exception as well)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added assertRaisesRegExp and put the PY check above. To do this I had to create an empty contextmanager for the PY3 case (when no exception is expected). Is there a better way?
394fee0
to
631df48
Compare
# issue GH10366 | ||
|
||
if compat.PY2: | ||
# in Python 2: TypeError: [unicode] is not implemented as a table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need all of this complication
def test_latin_encoding(self):
if compat.PY2:
self.assertRaisesRegexp(....)
return
values = ......
can you rebase / update? |
Rebased and squashed. |
@@ -1,4 +1,5 @@ | |||
import nose | |||
from nose.tools import assert_raises |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra import?
f199efc
to
7e0d6dd
Compare
@cottrell can you rebase and update according to my last |
…f serialization.
@jreback I have rebased and commented above that I did not get very far with the _convert_string_array suggestion you made. |
ok will take a look thxs |
replaced by #10889 |
closes #10366 .
Probably not quite the right approach but want to run travis. Tests are a bit weak at this point (just testing for non-exceptions). Encoding issues might be relevant to other types besides categoricals (but categoricals raise exceptions when strings to mangled to non-uniqueness).