Skip to content

Resolve Heisenbug in StringHashTable._unique #55530

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed

Conversation

MichaelTiemannOSC
Copy link
Contributor

@MichaelTiemannOSC MichaelTiemannOSC commented Oct 15, 2023

When processing an invalid Unicode string, the exception handler for UnicodeEncodeError called get_c_string with an ephemeral repr value that could be garbage-collected the next time an exception was raised. Issue #45929 demonstrates the problem.

This commit fixes the problem by retaining a Python reference to the repr value that underlies the C string until after all values are processed.

Wisdom from StackOverflow suggests that there's very small performance difference between pre-allocating the array vs. append if indeed we do need to fill it all the way, but because we only need references on exceptions, we expect that in the usual case we will append very few elements, making it faster than pre-allocation.

When processing an invalid Unicode string, the exception handler for UnicodeEncodeError called `get_c_string` with an ephemeral repr value that could be garbage-collected the next time an exception was raised.  Issue pandas-dev#45929 demonstrates the problem.

This commit fixes the problem by retaining a Python reference to the repr value that underlies the C string until after all `values` are processed.

Wisdom from StackOverflow suggests that there's very small performance difference between pre-allocating the array vs. append if indeed we do need to fill it all the way, but because we only need references on exceptions, we expect that in the usual case we will append very few elements, making it faster than pre-allocation.

Signed-off-by: Michael Tiemann <[email protected]>
The `single_cpu` attribute for `test_unique_bad_unicode` was likely an attempt to cover over the underlying bug fixed with this commit.  We can now run this test in the usual fashion.

Added a test case for the problem reported in 45929.

Signed-off-by: Michael Tiemann <[email protected]>
@@ -1128,6 +1128,7 @@ cdef class StringHashTable(HashTable):
use_na_value = na_value is not None

# assign pointers and pre-filter out missing (if ignore_na)
keep_rval_refs = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't the same issue happen when the function exits if keep_rval_refs gets GC'ed but the exception message still persists? This may delay the issue from happening but I'm not sure it fully solves it

Copy link
Contributor Author

@MichaelTiemannOSC MichaelTiemannOSC Oct 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is bounded in time between the scanning of values in the first range(n) loop and the finalization of uniques within the function. The exception message does not convey the ephemeral repr value. And uniques consists only of Python objects. When the function returns, only Python objects, not ephemeral C objects, are reachable.

To say this a bit more clearly, we only need the ephemeral repr values to live long enough to become Python objects in the kh_get_str/kh_put_str processing (the second range(n) loop).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the return value have the same lifecycle issues though? The lifecycle of uniques that is getting returned as well as self.table outlive this function, so once keep_rval_refs gets GC'ed I think you still have the same issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not at all, for two reasons. The first reason is that the rvals in question only need to be kept alive because they are what controls whether the ephemeral C strings live or die when the next GC runs. When there is no longer a need to reference those C strings, there's no need to protect the C strings from GC. The second reason is that the thing that keeps the rvals alive is an actual live reference to the rval. That is what keep_rval_refs does. Anything that self.table can reference, anything that uniques can reference is just as live a reference as anything else. So the very fact that self.table or uniques live beyond the scope of this function is all that's needed to keep the Python objects they contain from being reaped. There's nothing special about the contents of keep_rval_refs after the function exits.

The vecs vector contains halflings (C strings contained within Python objects). Once we no longer need to touch the C strings, once we are only dealing with full-fledged Python objects, we can rely on Python's GC to do the right thing with the objects we still need and with the objects we no longer need. It is because we need to touch the strings within ephemeral repr objects that we have to keep those objects alive as long as the strings need to live. Not longer, and not because somebody else also might need the Python objects (because that need itself will keep the object alive).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The solution here is to figure out how to use the CPython API and not introduce any new lifecycle management. I am guessing that PyUnicode_AsUTF8AndSize is throwing an error on the surrogate pair, since that is what happens in the interpreter:

>>> '1 \udcd6a NY'.encode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd6' in position 2: surrogates not allowed

Is there a way to handle that surrogate better in the C-API? (we can get rid of get_c_string in the first place; I think this was just left around as a Py2/3 compat function but is equivaelent to PyUnicode_AsUTF8AndSize)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the initial implementation touches a live wire when it reaches into Python to touch its C strings. All I did was create the industrial-strength insulating gloves to make that safe. Alas, what I don't know about Unicode could fill several books, so I would defer to somebody who has that knowledge. I just want to get my complex number tests integrated 🤓

@WillAyd WillAyd mentioned this pull request Oct 15, 2023
5 tasks
@mroeschke mroeschke added the Strings String extension data type and string data label Oct 16, 2023
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 16, 2023
@MichaelTiemannOSC
Copy link
Contributor Author

Given the new Cython 3.0 changes about to be merged, are there any C-API folks who might know how to take this across the finish line?

Correctly alphabetize items in `Other` list.

Signed-off-by: Michael Tiemann <[email protected]>
@mroeschke
Copy link
Member

Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Stale Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: pandas.Series.unique() does not return correct unique values on \u string
3 participants