Skip to content

fix hashing string-casting error #21187

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Jun 21, 2018
1 change: 1 addition & 0 deletions doc/source/whatsnew/v0.23.1.txt
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,7 @@ Bug Fixes
- Bug in :meth:`Series.str.replace()` where the method throws `TypeError` on Python 3.5.2 (:issue: `21078`)
- Bug in :class:`Timedelta`: where passing a float with a unit would prematurely round the float precision (:issue: `14156`)
- Bug in :func:`pandas.testing.assert_index_equal` which raised ``AssertionError`` incorrectly, when comparing two :class:`CategoricalIndex` objects with param ``check_categorical=False`` (:issue:`19776`)
- Bug in rendering :class:`Series` with ``Categorical`` dtype in rare conditions under Python 2.7 (:issue:`21002`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.23.2


**Sparse**

Expand Down
7 changes: 2 additions & 5 deletions pandas/_libs/hashing.pyx
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ import numpy as np
from numpy cimport ndarray, uint8_t, uint32_t, uint64_t

from util cimport _checknull
from cpython cimport (PyString_Check,
PyBytes_Check,
from cpython cimport (PyBytes_Check,
PyUnicode_Check)
from libc.stdlib cimport malloc, free

Expand Down Expand Up @@ -62,9 +61,7 @@ def hash_object_array(ndarray[object] arr, object key, object encoding='utf8'):
cdef list datas = []
for i in range(n):
val = arr[i]
if PyString_Check(val):
data = <bytes>val.encode(encoding)
elif PyBytes_Check(val):
if PyBytes_Check(val):
data = <bytes>val
elif PyUnicode_Check(val):
data = <bytes>val.encode(encoding)
Expand Down
36 changes: 36 additions & 0 deletions pandas/tests/series/test_repr.py
Original file line number Diff line number Diff line change
Expand Up @@ -202,6 +202,42 @@ def test_latex_repr(self):

class TestCategoricalRepr(object):

def test_categorical_repr_unicode(self):
# GH#21002 if len(index) > 60, sys.getdefaultencoding()=='ascii',
# and we are working in PY2, then rendering a Categorical could raise
# UnicodeDecodeError by trying to decode when it shouldn't
from pandas.core.base import StringMixin
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can import at the top


class County(StringMixin):
name = u'San Sebastián'
state = u'PR'

def __unicode__(self):
return self.name + u', ' + self.state

cat = pd.Categorical([County() for n in range(61)])
idx = pd.Index(cat)
ser = idx.to_series()

if compat.PY3:
# no reloading of sys, just check that the default (utf8) works
# as expected
repr(ser)
str(ser)

else:
# set sys.defaultencoding to ascii, then change it back after
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you make this into a context manager in pandas.util.testing

# the test
enc = sys.getdefaultencoding()
reload(sys) # noqa:F821
sys.setdefaultencoding('ascii')
try:
repr(ser)
str(ser)
finally:
# restore encoding
sys.setdefaultencoding(enc)

def test_categorical_repr(self):
a = Series(Categorical([1, 2, 3, 4]))
exp = u("0 1\n1 2\n2 3\n3 4\n" +
Expand Down