Resolve Heisenbug in StringHashTable._unique #55530

MichaelTiemannOSC · 2023-10-15T12:18:07Z

When processing an invalid Unicode string, the exception handler for UnicodeEncodeError called get_c_string with an ephemeral repr value that could be garbage-collected the next time an exception was raised. Issue #45929 demonstrates the problem.

This commit fixes the problem by retaining a Python reference to the repr value that underlies the C string until after all values are processed.

Wisdom from StackOverflow suggests that there's very small performance difference between pre-allocating the array vs. append if indeed we do need to fill it all the way, but because we only need references on exceptions, we expect that in the usual case we will append very few elements, making it faster than pre-allocation.

closes BUG: pandas.Series.unique() does not return correct unique values on \u string #45929
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

When processing an invalid Unicode string, the exception handler for UnicodeEncodeError called `get_c_string` with an ephemeral repr value that could be garbage-collected the next time an exception was raised. Issue pandas-dev#45929 demonstrates the problem. This commit fixes the problem by retaining a Python reference to the repr value that underlies the C string until after all `values` are processed. Wisdom from StackOverflow suggests that there's very small performance difference between pre-allocating the array vs. append if indeed we do need to fill it all the way, but because we only need references on exceptions, we expect that in the usual case we will append very few elements, making it faster than pre-allocation. Signed-off-by: Michael Tiemann <[email protected]>

The `single_cpu` attribute for `test_unique_bad_unicode` was likely an attempt to cover over the underlying bug fixed with this commit. We can now run this test in the usual fashion. Added a test case for the problem reported in 45929. Signed-off-by: Michael Tiemann <[email protected]>

WillAyd · 2023-10-15T14:24:59Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -1128,6 +1128,7 @@ cdef class StringHashTable(HashTable):
        use_na_value = na_value is not None

        # assign pointers and pre-filter out missing (if ignore_na)
+        keep_rval_refs = []


Wouldn't the same issue happen when the function exits if keep_rval_refs gets GC'ed but the exception message still persists? This may delay the issue from happening but I'm not sure it fully solves it

The problem is bounded in time between the scanning of values in the first range(n) loop and the finalization of uniques within the function. The exception message does not convey the ephemeral repr value. And uniques consists only of Python objects. When the function returns, only Python objects, not ephemeral C objects, are reachable.

To say this a bit more clearly, we only need the ephemeral repr values to live long enough to become Python objects in the kh_get_str/kh_put_str processing (the second range(n) loop).

Doesn't the return value have the same lifecycle issues though? The lifecycle of uniques that is getting returned as well as self.table outlive this function, so once keep_rval_refs gets GC'ed I think you still have the same issue

Not at all, for two reasons. The first reason is that the rvals in question only need to be kept alive because they are what controls whether the ephemeral C strings live or die when the next GC runs. When there is no longer a need to reference those C strings, there's no need to protect the C strings from GC. The second reason is that the thing that keeps the rvals alive is an actual live reference to the rval. That is what keep_rval_refs does. Anything that self.table can reference, anything that uniques can reference is just as live a reference as anything else. So the very fact that self.table or uniques live beyond the scope of this function is all that's needed to keep the Python objects they contain from being reaped. There's nothing special about the contents of keep_rval_refs after the function exits.

The vecs vector contains halflings (C strings contained within Python objects). Once we no longer need to touch the C strings, once we are only dealing with full-fledged Python objects, we can rely on Python's GC to do the right thing with the objects we still need and with the objects we no longer need. It is because we need to touch the strings within ephemeral repr objects that we have to keep those objects alive as long as the strings need to live. Not longer, and not because somebody else also might need the Python objects (because that need itself will keep the object alive).

The solution here is to figure out how to use the CPython API and not introduce any new lifecycle management. I am guessing that PyUnicode_AsUTF8AndSize is throwing an error on the surrogate pair, since that is what happens in the interpreter:

>>> '1 \udcd6a NY'.encode("utf8") Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd6' in position 2: surrogates not allowed

Is there a way to handle that surrogate better in the C-API? (we can get rid of get_c_string in the first place; I think this was just left around as a Py2/3 compat function but is equivaelent to PyUnicode_AsUTF8AndSize)

I agree that the initial implementation touches a live wire when it reaches into Python to touch its C strings. All I did was create the industrial-strength insulating gloves to make that safe. Alas, what I don't know about Unicode could fill several books, so I would defer to somebody who has that knowledge. I just want to get my complex number tests integrated 🤓

github-actions · 2023-11-16T00:05:31Z

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

MichaelTiemannOSC · 2023-11-16T01:53:17Z

Given the new Cython 3.0 changes about to be merged, are there any C-API folks who might know how to take this across the finish line?

Correctly alphabetize items in `Other` list. Signed-off-by: Michael Tiemann <[email protected]>

mroeschke · 2023-12-09T19:36:34Z

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

MichaelTiemannOSC requested a review from WillAyd as a code owner October 15, 2023 12:18

WillAyd reviewed Oct 15, 2023

View reviewed changes

WillAyd mentioned this pull request Oct 15, 2023

Enable many complex number tests #54761

Closed

5 tasks

mroeschke added the Strings String extension data type and string data label Oct 16, 2023

github-actions bot added the Stale label Nov 16, 2023

Merge branch 'main' into issue_45929

fd9f388

Update v2.2.0.rst

f33cdc0

Correctly alphabetize items in `Other` list. Signed-off-by: Michael Tiemann <[email protected]>

mroeschke closed this Dec 9, 2023

mroeschke mentioned this pull request Apr 10, 2024

BUG: pandas.Series.unique() does not return correct unique values on non UTF8 enodeable strings #58215

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Resolve Heisenbug in StringHashTable._unique #55530

Resolve Heisenbug in StringHashTable._unique #55530

MichaelTiemannOSC commented Oct 15, 2023 •

edited

Loading

WillAyd Oct 15, 2023

MichaelTiemannOSC Oct 15, 2023 •

edited

Loading

WillAyd Oct 15, 2023

MichaelTiemannOSC Oct 15, 2023

WillAyd Oct 15, 2023

MichaelTiemannOSC Oct 15, 2023

github-actions bot commented Nov 16, 2023

MichaelTiemannOSC commented Nov 16, 2023

mroeschke commented Dec 9, 2023

Resolve Heisenbug in StringHashTable._unique #55530

Resolve Heisenbug in StringHashTable._unique #55530

Conversation

MichaelTiemannOSC commented Oct 15, 2023 • edited Loading

WillAyd Oct 15, 2023

Choose a reason for hiding this comment

MichaelTiemannOSC Oct 15, 2023 • edited Loading

Choose a reason for hiding this comment

WillAyd Oct 15, 2023

Choose a reason for hiding this comment

MichaelTiemannOSC Oct 15, 2023

Choose a reason for hiding this comment

WillAyd Oct 15, 2023

Choose a reason for hiding this comment

MichaelTiemannOSC Oct 15, 2023

Choose a reason for hiding this comment

github-actions bot commented Nov 16, 2023

MichaelTiemannOSC commented Nov 16, 2023

mroeschke commented Dec 9, 2023

MichaelTiemannOSC commented Oct 15, 2023 •

edited

Loading

MichaelTiemannOSC Oct 15, 2023 •

edited

Loading