Skip to content

BUG: Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name #32701

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

roberthdevries
Copy link
Contributor

Return a repr() version if the column name string is not printable. This also means the the column name is not present in the output of dir()

@roberthdevries roberthdevries changed the title Fix segfault on dir of a DataFrame with an unicode surrogate character in the column name Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name Mar 14, 2020
@WillAyd
Copy link
Member

WillAyd commented Mar 14, 2020

Thanks for the PR! I'm not sure falling back to the repr is the right approach though. A call to print would raise:

>>> colname = "\ud83d"
>>> print(colname)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

So I think should do the same here

@roberthdevries
Copy link
Contributor Author

I think this column name should be treated like other non-identifier column names. So this means that it gets ignored in dir(), but you can see it when you print the data frame.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run benchmarks, and see if anything changes, this call likeley impacts most of the asv's


buf = PyUnicode_AsUTF8AndSize(py_string, length)
return buf
if not py_string.isprintable():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we be calling the c-func here? this is a heavily used path

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which function did you have in mind? The py_string is not something you can give to a C standard library function AFAIK. Plus this gets translated to C code calling python functions anyway.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show the before & after generated c-code here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No longer relevant, I changed the code so that this is no longer there.

@jreback jreback added Output-Formatting __repr__ of pandas objects, to_string Unicode Unicode strings labels Mar 14, 2020
@roberthdevries roberthdevries force-pushed the fix-25509-dir-failure-on-df-with-unicode-surrogates branch from 37bbffa to 6ec08ac Compare March 14, 2020 22:28
@roberthdevries roberthdevries requested a review from jreback March 14, 2020 22:44
@roberthdevries roberthdevries force-pushed the fix-25509-dir-failure-on-df-with-unicode-surrogates branch from 6ec08ac to bb77b0b Compare March 15, 2020 07:22
@roberthdevries
Copy link
Contributor Author

I have now pushed the check for printable characters up in the failing call chain. This is the only fix that I could think of which did not print any error messages.

I have also made a mini bug reproducing script that just calls the underlying failing pandas call:

import pandas as pd
import numpy as np

a = np.ndarray(shape=(1,), dtype=np.object)
a.fill('\ud83d')
table = pd._libs.hashtable.StringHashTable(2)
print(table.unique(a))

@@ -192,7 +192,7 @@ def test_categorical_dtype_utf16(all_parsers, csv_dir_path):
pth = os.path.join(csv_dir_path, "utf16_ex.txt")
parser = all_parsers
encoding = "utf-16"
sep = ","
sep = "\t"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did this change?

Copy link
Contributor Author

@roberthdevries roberthdevries Mar 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test failed for some reason. So when I investigated this I noticed that the separator in this file is a tab rather than a ,. So I changed this and the test passed. I am not sure what caused the failure, but I think that the test was wrong to begin with and there is some weird feature interaction that caused the failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yea very strange. I guess since there isn't a comma in this file at all it just reads every line into a single column....

@jreback
Copy link
Contributor

jreback commented Mar 16, 2020

can you run the algos asv's (e.g. the things that hit unique) and report the results.

@roberthdevries roberthdevries requested a review from jreback March 17, 2020 20:34
@roberthdevries
Copy link
Contributor Author

You lost me here, what are the algos asv's?

@WillAyd
Copy link
Member

WillAyd commented Mar 18, 2020

Would be in regards to this:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

so from asv_bench folder run

asv continuous upstream/master HEAD -b algorithms

@roberthdevries
Copy link
Contributor Author

roberthdevries commented Mar 18, 2020

The results are a significant performance decrease for the string cases. (factor 2.3 to 1.4 worse)
I guess I have to figure out to reset the python error when the conversion fails.

@roberthdevries roberthdevries force-pushed the fix-25509-dir-failure-on-df-with-unicode-surrogates branch from 7447607 to a52e59c Compare March 18, 2020 20:47
@roberthdevries
Copy link
Contributor Author

With the current patch:

       before           after         ratio
     [e72e2dd1]       [a52e59cd]
     <master>         <fix-25509-dir-failure-on-df-with-unicode-surrogates>
-      5.83±0.8μs      4.84±0.02μs     0.83  algorithms.Duplicated.time_duplicated(True, 'first', 'string')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@roberthdevries roberthdevries changed the title Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name BUG: Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name Mar 18, 2020
@@ -12,6 +12,11 @@ WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
from pandas._libs.tslibs.util cimport get_c_string
from pandas._libs.missing cimport C_NA

cdef extern from "Python.h":
# Note: importing extern-style allows us to declare these as nogil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is interacting with Python objects so it needs to be called with the GIL

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it, I am quite new to cython, so I have to read up a bit on these things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are getting warnings / errors when trying to cythonize you might need a with gil: block in your error handler

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not getting any cython errors, so I am probably good.

@roberthdevries roberthdevries requested a review from WillAyd March 18, 2020 22:51
@@ -192,7 +192,7 @@ def test_categorical_dtype_utf16(all_parsers, csv_dir_path):
pth = os.path.join(csv_dir_path, "utf16_ex.txt")
parser = all_parsers
encoding = "utf-16"
sep = ","
sep = "\t"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yea very strange. I guess since there isn't a comma in this file at all it just reads every line into a single column....

@jreback jreback added this to the 1.1 milestone Mar 19, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. assume this is not perf difference.

@roberthdevries
Copy link
Contributor Author

Another point is that I am not quite sure why the conversion to UTF-8 is needed to compute a hash value. If we can avoid the conversion, the hashing should even be faster.

@WillAyd WillAyd merged commit 5b18d3c into pandas-dev:master Mar 19, 2020
@WillAyd
Copy link
Member

WillAyd commented Mar 19, 2020

Thanks @roberthdevries another great PR

Another point is that I am not quite sure why the conversion to UTF-8 is needed to compute a hash value. If we can avoid the conversion, the hashing should even be faster.

If you see a way to improve performance here would welcome a follow up PR

@roberthdevries roberthdevries deleted the fix-25509-dir-failure-on-df-with-unicode-surrogates branch March 19, 2020 22:02
SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Output-Formatting __repr__ of pandas objects, to_string Unicode Unicode strings
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dir fails on dataframes with pathological column names
3 participants