Skip to content

Resolve Heisenbug in StringHashTable._unique #55530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Closed
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v2.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -403,6 +403,7 @@ Other
^^^^^
- Bug in :func:`cut` incorrectly allowing cutting of timezone-aware datetimes with timezone-naive bins (:issue:`54964`)
- Bug in :meth:`DataFrame.apply` where passing ``raw=True`` ignored ``args`` passed to the applied function (:issue:`55009`)
- Bug in Cython :meth:`StringHashTable._unique` used ephemeral repr values when UnicodeEncodeError was raised (:issue:`45929`)
- Bug in rendering ``inf`` values inside a a :class:`DataFrame` with the ``use_inf_as_na`` option enabled (:issue:`55483`)
- Bug in rendering a :class:`Series` with a :class:`MultiIndex` when one of the index level's names is 0 not having that name displayed (:issue:`55415`)
-
Expand Down
5 changes: 4 additions & 1 deletion pandas/_libs/hashtable_class_helper.pxi.in
Original file line number Diff line number Diff line change
Expand Up @@ -1128,6 +1128,7 @@ cdef class StringHashTable(HashTable):
use_na_value = na_value is not None

# assign pointers and pre-filter out missing (if ignore_na)
keep_rval_refs = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the same issue happen when the function exits if keep_rval_refs gets GC'ed but the exception message still persists? This may delay the issue from happening but I'm not sure it fully solves it

Copy link
Contributor Author

@MichaelTiemannOSC MichaelTiemannOSC Oct 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is bounded in time between the scanning of values in the first range(n) loop and the finalization of uniques within the function. The exception message does not convey the ephemeral repr value. And uniques consists only of Python objects. When the function returns, only Python objects, not ephemeral C objects, are reachable.

To say this a bit more clearly, we only need the ephemeral repr values to live long enough to become Python objects in the kh_get_str/kh_put_str processing (the second range(n) loop).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the return value have the same lifecycle issues though? The lifecycle of uniques that is getting returned as well as self.table outlive this function, so once keep_rval_refs gets GC'ed I think you still have the same issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all, for two reasons. The first reason is that the rvals in question only need to be kept alive because they are what controls whether the ephemeral C strings live or die when the next GC runs. When there is no longer a need to reference those C strings, there's no need to protect the C strings from GC. The second reason is that the thing that keeps the rvals alive is an actual live reference to the rval. That is what keep_rval_refs does. Anything that self.table can reference, anything that uniques can reference is just as live a reference as anything else. So the very fact that self.table or uniques live beyond the scope of this function is all that's needed to keep the Python objects they contain from being reaped. There's nothing special about the contents of keep_rval_refs after the function exits.

The vecs vector contains halflings (C strings contained within Python objects). Once we no longer need to touch the C strings, once we are only dealing with full-fledged Python objects, we can rely on Python's GC to do the right thing with the objects we still need and with the objects we no longer need. It is because we need to touch the strings within ephemeral repr objects that we have to keep those objects alive as long as the strings need to live. Not longer, and not because somebody else also might need the Python objects (because that need itself will keep the object alive).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution here is to figure out how to use the CPython API and not introduce any new lifecycle management. I am guessing that PyUnicode_AsUTF8AndSize is throwing an error on the surrogate pair, since that is what happens in the interpreter:

>>> '1 \udcd6a NY'.encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd6' in position 2: surrogates not allowed

Is there a way to handle that surrogate better in the C-API? (we can get rid of get_c_string in the first place; I think this was just left around as a Py2/3 compat function but is equivaelent to PyUnicode_AsUTF8AndSize)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the initial implementation touches a live wire when it reaches into Python to touch its C strings. All I did was create the industrial-strength insulating gloves to make that safe. Alas, what I don't know about Unicode could fill several books, so I would defer to somebody who has that knowledge. I just want to get my complex number tests integrated 🤓

vecs = <const char **>malloc(n * sizeof(char *))
for i in range(n):
val = values[i]
Expand All @@ -1144,7 +1145,9 @@ cdef class StringHashTable(HashTable):
try:
v = get_c_string(<str>val)
except UnicodeEncodeError:
v = get_c_string(<str>repr(val))
rval = <str>repr(val)
keep_rval_refs.append(rval)
v = get_c_string(rval)
vecs[i] = v

# compute
Expand Down
19 changes: 18 additions & 1 deletion pandas/tests/base/test_unique.py
Original file line number Diff line number Diff line change
Expand Up @@ -97,7 +97,6 @@ def test_nunique_null(null_obj, index_or_series_obj):
assert obj.nunique(dropna=False) == max(0, num_unique_values)


@pytest.mark.single_cpu
def test_unique_bad_unicode(index_or_series):
# regression test for #34550
uval = "\ud83d" # smiley emoji
Expand All @@ -113,6 +112,24 @@ def test_unique_bad_unicode(index_or_series):
tm.assert_numpy_array_equal(result, expected)


def test_unique_45929(index_or_series):
# regression test for #45929
data_list = [
"1 \udcd6a NY",
"2 \udcd6b NY",
"3 \ud800c NY",
"4 \udcd6d NY",
"5 \udcc3e NY",
]

obj = index_or_series(data_list)
assert len(obj.unique()) == len(data_list)
assert len(obj.value_counts()) == len(data_list)
assert len(np.unique(data_list)) == len(data_list)
assert len(set(data_list)) == len(data_list)
assert obj.is_unique


@pytest.mark.parametrize("dropna", [True, False])
def test_nunique_dropna(dropna):
# GH37566
Expand Down