BUG: pandas.Series.unique() does not return correct unique values on non UTF8 enodeable strings #58215

mroeschke · 2024-04-10T22:30:52Z

closes BUG: pandas.Series.unique() does not return correct unique values on \u string #45929 (Replace xxxx with the GitHub issue number)

Revival of #55530

When processing an invalid Unicode string, the exception handler for UnicodeEncodeError called `get_c_string` with an ephemeral repr value that could be garbage-collected the next time an exception was raised. Issue pandas-dev#45929 demonstrates the problem. This commit fixes the problem by retaining a Python reference to the repr value that underlies the C string until after all `values` are processed. Wisdom from StackOverflow suggests that there's very small performance difference between pre-allocating the array vs. append if indeed we do need to fill it all the way, but because we only need references on exceptions, we expect that in the usual case we will append very few elements, making it faster than pre-allocation. Signed-off-by: Michael Tiemann <[email protected]>

The `single_cpu` attribute for `test_unique_bad_unicode` was likely an attempt to cover over the underlying bug fixed with this commit. We can now run this test in the usual fashion. Added a test case for the problem reported in 45929. Signed-off-by: Michael Tiemann <[email protected]>

Correctly alphabetize items in `Other` list. Signed-off-by: Michael Tiemann <[email protected]>

WillAyd · 2024-04-10T23:03:47Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -1179,6 +1179,8 @@ cdef class StringHashTable(HashTable):
        use_na_value = na_value is not None

        # assign pointers and pre-filter out missing (if ignore_na)
+        # https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#caveats-when-using-a-python-string-in-a-c-context
+        keep_bad_unicode_refs = []


This doesn't fix the underlying issue. It may make it less likely but the lifecycle is still not properly controlled

I think what needs to be done is something akin to:

'1 \udcd6a NY'.encode('utf8', errors="surrogatepass").decode('utf8', errors="surrogatepass")

I think you can just do that directly in Cython, but we can look at the underlying C API if needed

(side note - I would really love to get rid of get_c_string - not a value added layer of indirection)

I was able to replace get_c_string with the associated C APIs, but if I understand https://cython.readthedocs.io/en/latest/src/userguide/language_basics.html#caveats-when-using-a-python-string-in-a-c-context correctly, since we're storing char pointers from these encoded python strings and using them later, we need to keep a reference to those Python strings

Oh OK I see the difference - CPython can cache the UTF8 bytes of a string alongside the string object. When the string is not utf8 encodable and we have to create temporary objects we run into trouble. So the list here artificially extends the lifetime of new objects guaranteed to be utf8 encodable to rely on that caching mechanism

To be honest I would be -1 on this change and would rather call it a wontfix. It is a very niche issue that fights the internals of CPython (and for that matter pyarrow, whose strings are utf8).

If someone wanted mixed-encoding Python strings like this I think pa.binary() is a better data type choice

OK I'll close this PR then. My main motivation was to make test_unique_bad_unicode not flaky due to this issue, but I'll open a follow up PR making this test permanentlyxfail(strict=True)instead

I think the name test_unique_bad_unicode is a misnomer. The code point "\ud83d" exists in Unicode. It is a high surrogate

https://www.unicode.org/charts/PDF/UD800.pdf

The problem is that by itself that high surrogate doesn't mean anything (it would need to be paired with a low surrogate). As such, it doesn't represent any glyph in any encoding.

AFAIU if you wanted to keep that unicode code point, you would have to:

Convert the Python Unicode object to bytes via str.encode(<encoding>, errors="surrogatepass")

Run your algorithms against the surrogatepass bytes

Convert your surrogatepassed bytes back to a unicode string via `bytes.decode(, errors="surrogatepass")

I realize step 2 may not exist today but is a good impetus to work on interop with pa.binary() if this is required

MichaelTiemannOSC and others added 8 commits October 15, 2023 08:15

Merge branch 'main' into issue_45929

fd9f388

Update v2.2.0.rst

f33cdc0

Correctly alphabetize items in `Other` list. Signed-off-by: Michael Tiemann <[email protected]>

Merge remote-tracking branch 'upstream/main' into issue_45929

e687204

Add comment and clear our references

9abe522

Add whatsnew

043fc35

More exact test

0de9f5f

mroeschke added the Strings String extension data type and string data label Apr 10, 2024

mroeschke requested a review from WillAyd as a code owner April 10, 2024 22:30

mroeschke changed the title ~~BUG: pandas.Series.unique() does not return correct unique values on non unicode strings~~ BUG: pandas.Series.unique() does not return correct unique values on non UTF8 enodeable strings Apr 10, 2024

WillAyd requested changes Apr 10, 2024

View reviewed changes

mroeschke added 2 commits April 11, 2024 12:01

Use CAPIs

5d87ebc

Remove param

b78c6cb

mroeschke closed this Apr 12, 2024

mroeschke deleted the issue_45929 branch April 12, 2024 00:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: pandas.Series.unique() does not return correct unique values on non UTF8 enodeable strings #58215

BUG: pandas.Series.unique() does not return correct unique values on non UTF8 enodeable strings #58215

mroeschke commented Apr 10, 2024

WillAyd Apr 10, 2024

mroeschke Apr 11, 2024

WillAyd Apr 11, 2024 •

edited

Loading

mroeschke Apr 12, 2024

WillAyd Apr 12, 2024

BUG: pandas.Series.unique() does not return correct unique values on non UTF8 enodeable strings #58215

BUG: pandas.Series.unique() does not return correct unique values on non UTF8 enodeable strings #58215

Conversation

mroeschke commented Apr 10, 2024

WillAyd Apr 10, 2024

Choose a reason for hiding this comment

mroeschke Apr 11, 2024

Choose a reason for hiding this comment

WillAyd Apr 11, 2024 • edited Loading

Choose a reason for hiding this comment

mroeschke Apr 12, 2024

Choose a reason for hiding this comment

WillAyd Apr 12, 2024

Choose a reason for hiding this comment

WillAyd Apr 11, 2024 •

edited

Loading