-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Dir fails on dataframes with pathological column names #25509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
On OSX this segfaults |
What is the expectation here? Is this the first half of a surrogate pair? |
Tracking this down, it looks like we get to tslibs.util.get_c_string_buf_and_size and within that we call |
So with regards to the OP I don't think this is a bug with pandas - an exception gets thrown when passing it as an argument to >>> print('\ud83d')
>>> type('\ud83d')
<class 'str'>
>>> alist = ['\ud83d']
>>> alist[0] # surprised this works
'\ud83d'
>>> print(alist[0]) # this failure matches pandas
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed w.r.t. Cython I see the following warnings before segfault, so maybe something of interest there: UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
Exception ignored in: 'pandas._libs.tslibs.util.get_c_string_buf_and_size'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed
[1] 32913 segmentation fault python |
Doing some googling, is there a specific range of unicode characters that are surrogates that we might be able to screen for? Are there non-surrogate "pathological" cases we need to worry about? |
In
to
then a) the Not sure what to do with this information, but its out there. |
I have traced this down to This is where the NULL pointer gets assigned to an array value in # if ignore_na is False, we also stringify NaN/None/etc.
v = get_c_string(<str>val)
vecs[i] = v |
Code Sample, a copy-pastable example if possible
Problem description
Dir fails on dataframes with pathalogical column names
Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.24.1
pytest: 3.10.1
pip: 18.1
setuptools: 40.6.2
Cython: None
numpy: 1.15.4
scipy: 1.1.0
pyarrow: 0.11.1
xarray: 0.11.3
IPython: 7.2.0
sphinx: 1.8.4
patsy: None
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: 0.2.0
fastparquet: 0.1.6
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: