-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Resolve Heisenbug in StringHashTable._unique #55530
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
When processing an invalid Unicode string, the exception handler for UnicodeEncodeError called `get_c_string` with an ephemeral repr value that could be garbage-collected the next time an exception was raised. Issue pandas-dev#45929 demonstrates the problem. This commit fixes the problem by retaining a Python reference to the repr value that underlies the C string until after all `values` are processed. Wisdom from StackOverflow suggests that there's very small performance difference between pre-allocating the array vs. append if indeed we do need to fill it all the way, but because we only need references on exceptions, we expect that in the usual case we will append very few elements, making it faster than pre-allocation. Signed-off-by: Michael Tiemann <[email protected]>
The `single_cpu` attribute for `test_unique_bad_unicode` was likely an attempt to cover over the underlying bug fixed with this commit. We can now run this test in the usual fashion. Added a test case for the problem reported in 45929. Signed-off-by: Michael Tiemann <[email protected]>
@@ -1128,6 +1128,7 @@ cdef class StringHashTable(HashTable): | |||
use_na_value = na_value is not None | |||
|
|||
# assign pointers and pre-filter out missing (if ignore_na) | |||
keep_rval_refs = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't the same issue happen when the function exits if keep_rval_refs
gets GC'ed but the exception message still persists? This may delay the issue from happening but I'm not sure it fully solves it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is bounded in time between the scanning of values
in the first range(n)
loop and the finalization of uniques
within the function. The exception message does not convey the ephemeral repr value. And uniques
consists only of Python objects. When the function returns, only Python objects, not ephemeral C objects, are reachable.
To say this a bit more clearly, we only need the ephemeral repr values to live long enough to become Python objects in the kh_get_str/kh_put_str
processing (the second range(n)
loop).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't the return value have the same lifecycle issues though? The lifecycle of uniques
that is getting returned as well as self.table
outlive this function, so once keep_rval_refs
gets GC'ed I think you still have the same issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not at all, for two reasons. The first reason is that the rvals in question only need to be kept alive because they are what controls whether the ephemeral C strings live or die when the next GC runs. When there is no longer a need to reference those C strings, there's no need to protect the C strings from GC. The second reason is that the thing that keeps the rvals alive is an actual live reference to the rval. That is what keep_rval_refs
does. Anything that self.table
can reference, anything that uniques
can reference is just as live a reference as anything else. So the very fact that self.table
or uniques
live beyond the scope of this function is all that's needed to keep the Python objects they contain from being reaped. There's nothing special about the contents of keep_rval_refs
after the function exits.
The vecs
vector contains halflings (C strings contained within Python objects). Once we no longer need to touch the C strings, once we are only dealing with full-fledged Python objects, we can rely on Python's GC to do the right thing with the objects we still need and with the objects we no longer need. It is because we need to touch the strings within ephemeral repr objects that we have to keep those objects alive as long as the strings need to live. Not longer, and not because somebody else also might need the Python objects (because that need itself will keep the object alive).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The solution here is to figure out how to use the CPython API and not introduce any new lifecycle management. I am guessing that PyUnicode_AsUTF8AndSize
is throwing an error on the surrogate pair, since that is what happens in the interpreter:
>>> '1 \udcd6a NY'.encode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\udcd6' in position 2: surrogates not allowed
Is there a way to handle that surrogate better in the C-API? (we can get rid of get_c_string
in the first place; I think this was just left around as a Py2/3 compat function but is equivaelent to PyUnicode_AsUTF8AndSize
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the initial implementation touches a live wire when it reaches into Python to touch its C strings. All I did was create the industrial-strength insulating gloves to make that safe. Alas, what I don't know about Unicode could fill several books, so I would defer to somebody who has that knowledge. I just want to get my complex number tests integrated 🤓
This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this. |
Given the new Cython 3.0 changes about to be merged, are there any C-API folks who might know how to take this across the finish line? |
Correctly alphabetize items in `Other` list. Signed-off-by: Michael Tiemann <[email protected]>
Thanks for the pull request, but it appears to have gone stale. If interested in continuing, please merge in the main branch, address any review comments and/or failing tests, and we can reopen. |
When processing an invalid Unicode string, the exception handler for UnicodeEncodeError called
get_c_string
with an ephemeral repr value that could be garbage-collected the next time an exception was raised. Issue #45929 demonstrates the problem.This commit fixes the problem by retaining a Python reference to the repr value that underlies the C string until after all
values
are processed.Wisdom from StackOverflow suggests that there's very small performance difference between pre-allocating the array vs. append if indeed we do need to fill it all the way, but because we only need references on exceptions, we expect that in the usual case we will append very few elements, making it faster than pre-allocation.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.