-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name #32701
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name #32701
Conversation
Thanks for the PR! I'm not sure falling back to the repr is the right approach though. A call to print would raise: >>> colname = "\ud83d"
>>> print(colname)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed So I think should do the same here |
I think this column name should be treated like other non-identifier column names. So this means that it gets ignored in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run benchmarks, and see if anything changes, this call likeley impacts most of the asv's
pandas/_libs/tslibs/util.pxd
Outdated
|
||
buf = PyUnicode_AsUTF8AndSize(py_string, length) | ||
return buf | ||
if not py_string.isprintable(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we be calling the c-func here? this is a heavily used path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which function did you have in mind? The py_string is not something you can give to a C standard library function AFAIK. Plus this gets translated to C code calling python functions anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show the before & after generated c-code here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No longer relevant, I changed the code so that this is no longer there.
37bbffa
to
6ec08ac
Compare
6ec08ac
to
bb77b0b
Compare
I have now pushed the check for printable characters up in the failing call chain. This is the only fix that I could think of which did not print any error messages. I have also made a mini bug reproducing script that just calls the underlying failing pandas call: import pandas as pd
import numpy as np
a = np.ndarray(shape=(1,), dtype=np.object)
a.fill('\ud83d')
table = pd._libs.hashtable.StringHashTable(2)
print(table.unique(a)) |
@@ -192,7 +192,7 @@ def test_categorical_dtype_utf16(all_parsers, csv_dir_path): | |||
pth = os.path.join(csv_dir_path, "utf16_ex.txt") | |||
parser = all_parsers | |||
encoding = "utf-16" | |||
sep = "," | |||
sep = "\t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why did this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test failed for some reason. So when I investigated this I noticed that the separator in this file is a tab rather than a ,. So I changed this and the test passed. I am not sure what caused the failure, but I think that the test was wrong to begin with and there is some weird feature interaction that caused the failure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yea very strange. I guess since there isn't a comma in this file at all it just reads every line into a single column....
can you run the algos asv's (e.g. the things that hit unique) and report the results. |
You lost me here, what are the algos asv's? |
Would be in regards to this: so from asv_bench folder run asv continuous upstream/master HEAD -b algorithms |
The results are a significant performance decrease for the string cases. (factor 2.3 to 1.4 worse) |
…r in the column name Return a repr() version if a string is not printable
7447607
to
a52e59c
Compare
With the current patch:
|
@@ -12,6 +12,11 @@ WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in | |||
from pandas._libs.tslibs.util cimport get_c_string | |||
from pandas._libs.missing cimport C_NA | |||
|
|||
cdef extern from "Python.h": | |||
# Note: importing extern-style allows us to declare these as nogil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interacting with Python objects so it needs to be called with the GIL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it, I am quite new to cython, so I have to read up a bit on these things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are getting warnings / errors when trying to cythonize you might need a with gil:
block in your error handler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not getting any cython errors, so I am probably good.
@@ -192,7 +192,7 @@ def test_categorical_dtype_utf16(all_parsers, csv_dir_path): | |||
pth = os.path.join(csv_dir_path, "utf16_ex.txt") | |||
parser = all_parsers | |||
encoding = "utf-16" | |||
sep = "," | |||
sep = "\t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm yea very strange. I guess since there isn't a comma in this file at all it just reads every line into a single column....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good. assume this is not perf difference.
Another point is that I am not quite sure why the conversion to UTF-8 is needed to compute a hash value. If we can avoid the conversion, the hashing should even be faster. |
Thanks @roberthdevries another great PR
If you see a way to improve performance here would welcome a follow up PR |
…acter in the column name (pandas-dev#32701)
…acter in the column name (pandas-dev#32701)
…acter in the column name (pandas-dev#32701)
Return a
repr()
version if the column name string is not printable. This also means the the column name is not present in the output ofdir()
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff