Skip to content

Commit c300b46

Browse files
Resolve Heisenbug in StringHashTable._unique
When processing an invalid Unicode string, the exception handler for UnicodeEncodeError called `get_c_string` with an ephemeral repr value that could be garbage-collected the next time an exception was raised. Issue pandas-dev#45929 demonstrates the problem. This commit fixes the problem by retaining a Python reference to the repr value that underlies the C string until after all `values` are processed. Wisdom from StackOverflow suggests that there's very small performance difference between pre-allocating the array vs. append if indeed we do need to fill it all the way, but because we only need references on exceptions, we expect that in the usual case we will append very few elements, making it faster than pre-allocation. Signed-off-by: Michael Tiemann <[email protected]>
1 parent 10cf330 commit c300b46

File tree

2 files changed

+5
-1
lines changed

2 files changed

+5
-1
lines changed

doc/source/whatsnew/v2.2.0.rst

+1
Original file line numberDiff line numberDiff line change
@@ -403,6 +403,7 @@ Other
403403
^^^^^
404404
- Bug in :func:`cut` incorrectly allowing cutting of timezone-aware datetimes with timezone-naive bins (:issue:`54964`)
405405
- Bug in :meth:`DataFrame.apply` where passing ``raw=True`` ignored ``args`` passed to the applied function (:issue:`55009`)
406+
- Bug in Cython :meth:`StringHashTable._unique` used ephemeral repr values when UnicodeEncodeError was raised (:issue:`45929`)
406407
- Bug in rendering ``inf`` values inside a a :class:`DataFrame` with the ``use_inf_as_na`` option enabled (:issue:`55483`)
407408
- Bug in rendering a :class:`Series` with a :class:`MultiIndex` when one of the index level's names is 0 not having that name displayed (:issue:`55415`)
408409
-

pandas/_libs/hashtable_class_helper.pxi.in

+4-1
Original file line numberDiff line numberDiff line change
@@ -1128,6 +1128,7 @@ cdef class StringHashTable(HashTable):
11281128
use_na_value = na_value is not None
11291129

11301130
# assign pointers and pre-filter out missing (if ignore_na)
1131+
keep_rval_refs = []
11311132
vecs = <const char **>malloc(n * sizeof(char *))
11321133
for i in range(n):
11331134
val = values[i]
@@ -1144,7 +1145,9 @@ cdef class StringHashTable(HashTable):
11441145
try:
11451146
v = get_c_string(<str>val)
11461147
except UnicodeEncodeError:
1147-
v = get_c_string(<str>repr(val))
1148+
rval = <str>repr(val)
1149+
keep_rval_refs.append(rval)
1150+
v = get_c_string(rval)
11481151
vecs[i] = v
11491152

11501153
# compute

0 commit comments

Comments
 (0)