BUG: Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name #32701

roberthdevries · 2020-03-14T15:13:37Z

Return a repr() version if the column name string is not printable. This also means the the column name is not present in the output of dir()

closes Dir fails on dataframes with pathological column names #25509
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

WillAyd · 2020-03-14T16:47:39Z

Thanks for the PR! I'm not sure falling back to the repr is the right approach though. A call to print would raise:

>>> colname = "\ud83d"
>>> print(colname)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

So I think should do the same here

roberthdevries · 2020-03-14T17:40:00Z

I think this column name should be treated like other non-identifier column names. So this means that it gets ignored in dir(), but you can see it when you print the data frame.

jreback

can you run benchmarks, and see if anything changes, this call likeley impacts most of the asv's

jreback · 2020-03-14T19:34:19Z

pandas/_libs/tslibs/util.pxd

-
-    buf = PyUnicode_AsUTF8AndSize(py_string, length)
-    return buf
+    if not py_string.isprintable():


shouldn't we be calling the c-func here? this is a heavily used path

Which function did you have in mind? The py_string is not something you can give to a C standard library function AFAIK. Plus this gets translated to C code calling python functions anyway.

can you show the before & after generated c-code here.

No longer relevant, I changed the code so that this is no longer there.

roberthdevries · 2020-03-15T09:56:33Z

I have now pushed the check for printable characters up in the failing call chain. This is the only fix that I could think of which did not print any error messages.

I have also made a mini bug reproducing script that just calls the underlying failing pandas call:

import pandas as pd
import numpy as np

a = np.ndarray(shape=(1,), dtype=np.object)
a.fill('\ud83d')
table = pd._libs.hashtable.StringHashTable(2)
print(table.unique(a))

jreback · 2020-03-16T01:38:57Z

pandas/tests/io/parser/test_dtypes.py

@@ -192,7 +192,7 @@ def test_categorical_dtype_utf16(all_parsers, csv_dir_path):
    pth = os.path.join(csv_dir_path, "utf16_ex.txt")
    parser = all_parsers
    encoding = "utf-16"
-    sep = ","
+    sep = "\t"


why did this change?

This test failed for some reason. So when I investigated this I noticed that the separator in this file is a tab rather than a ,. So I changed this and the test passed. I am not sure what caused the failure, but I think that the test was wrong to begin with and there is some weird feature interaction that caused the failure.

Hmm yea very strange. I guess since there isn't a comma in this file at all it just reads every line into a single column....

jreback · 2020-03-16T01:39:54Z

can you run the algos asv's (e.g. the things that hit unique) and report the results.

roberthdevries · 2020-03-17T20:36:04Z

You lost me here, what are the algos asv's?

WillAyd · 2020-03-18T00:04:31Z

Would be in regards to this:

https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#running-the-performance-test-suite

so from asv_bench folder run

asv continuous upstream/master HEAD -b algorithms

roberthdevries · 2020-03-18T20:22:12Z

The results are a significant performance decrease for the string cases. (factor 2.3 to 1.4 worse)
I guess I have to figure out to reset the python error when the conversion fails.

…r in the column name Return a repr() version if a string is not printable

…comma

…impact

roberthdevries · 2020-03-18T20:48:49Z

With the current patch:

       before           after         ratio
     [e72e2dd1]       [a52e59cd]
     <master>         <fix-25509-dir-failure-on-df-with-unicode-surrogates>
-      5.83±0.8μs      4.84±0.02μs     0.83  algorithms.Duplicated.time_duplicated(True, 'first', 'string')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

WillAyd · 2020-03-18T22:06:10Z

pandas/_libs/hashtable_class_helper.pxi.in

@@ -12,6 +12,11 @@ WARNING: DO NOT edit .pxi FILE directly, .pxi is generated from .pxi.in
 from pandas._libs.tslibs.util cimport get_c_string
 from pandas._libs.missing cimport C_NA

+cdef extern from "Python.h":
+    # Note: importing extern-style allows us to declare these as nogil


This is interacting with Python objects so it needs to be called with the GIL

Will remove it, I am quite new to cython, so I have to read up a bit on these things.

If you are getting warnings / errors when trying to cythonize you might need a with gil: block in your error handler

I am not getting any cython errors, so I am probably good.

pandas/_libs/hashtable_class_helper.pxi.in

WillAyd · 2020-03-18T23:42:17Z

pandas/tests/io/parser/test_dtypes.py

@@ -192,7 +192,7 @@ def test_categorical_dtype_utf16(all_parsers, csv_dir_path):
    pth = os.path.join(csv_dir_path, "utf16_ex.txt")
    parser = all_parsers
    encoding = "utf-16"
-    sep = ","
+    sep = "\t"


Hmm yea very strange. I guess since there isn't a comma in this file at all it just reads every line into a single column....

jreback

looks good. assume this is not perf difference.

roberthdevries · 2020-03-19T19:46:52Z

Another point is that I am not quite sure why the conversion to UTF-8 is needed to compute a hash value. If we can avoid the conversion, the hashing should even be faster.

WillAyd · 2020-03-19T19:51:11Z

Thanks @roberthdevries another great PR

Another point is that I am not quite sure why the conversion to UTF-8 is needed to compute a hash value. If we can avoid the conversion, the hashing should even be faster.

If you see a way to improve performance here would welcome a follow up PR

…acter in the column name (pandas-dev#32701)

roberthdevries changed the title ~~Fix segfault on dir of a DataFrame with an unicode surrogate character in the column name~~ Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name Mar 14, 2020

jreback requested changes Mar 14, 2020

View reviewed changes

jreback added Output-Formatting __repr__ of pandas objects, to_string Unicode Unicode strings labels Mar 14, 2020

roberthdevries force-pushed the fix-25509-dir-failure-on-df-with-unicode-surrogates branch from 37bbffa to 6ec08ac Compare March 14, 2020 22:28

roberthdevries requested a review from jreback March 14, 2020 22:44

roberthdevries force-pushed the fix-25509-dir-failure-on-df-with-unicode-surrogates branch from 6ec08ac to bb77b0b Compare March 15, 2020 07:22

jreback reviewed Mar 16, 2020

View reviewed changes

roberthdevries requested a review from jreback March 17, 2020 20:34

roberthdevries added 5 commits March 18, 2020 21:23

Fix segfault on dir of a DataFrame with an unicode surrogate characte…

bf5d59e

…r in the column name Return a repr() version if a string is not printable

Fix failing test, the separator in the data is actually a tab, not a …

611cca9

…comma

Pushed the fix a bit higher up in the call chain to prevent unneeded …

6a3e986

…impact

Fix linting error

27da130

Performance fix

a52e59c

roberthdevries force-pushed the fix-25509-dir-failure-on-df-with-unicode-surrogates branch from 7447607 to a52e59c Compare March 18, 2020 20:47

roberthdevries changed the title ~~Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name~~ BUG: Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name Mar 18, 2020

WillAyd requested changes Mar 18, 2020

View reviewed changes

Remove nogil on PyErr_Clear()

b2593bd

roberthdevries requested a review from WillAyd March 18, 2020 22:51

WillAyd requested changes Mar 18, 2020

View reviewed changes

jreback added this to the 1.1 milestone Mar 19, 2020

jreback approved these changes Mar 19, 2020

View reviewed changes

WillAyd approved these changes Mar 19, 2020

View reviewed changes

WillAyd merged commit 5b18d3c into pandas-dev:master Mar 19, 2020

roberthdevries deleted the fix-25509-dir-failure-on-df-with-unicode-surrogates branch March 19, 2020 22:02

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020

BUG: Fix segfault on dir of a DataFrame with a unicode surrogate char…

1a29e2b

…acter in the column name (pandas-dev#32701)

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020

BUG: Fix segfault on dir of a DataFrame with a unicode surrogate char…

1366dbf

…acter in the column name (pandas-dev#32701)

jbrockmendel mentioned this pull request Mar 23, 2020

CI: Exception ignored in: 'pandas._libs.tslibs.util.get_c_string_buf_and_size' #32951

Closed

jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 25, 2020

BUG: Fix segfault on dir of a DataFrame with a unicode surrogate char…

1db3b09

…acter in the column name (pandas-dev#32701)

jorisvandenbossche mentioned this pull request Jun 4, 2020

BUG: Series.unique segfaults on invalid unicode #34550

Closed

Uh oh!

BUG: Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name #32701

BUG: Fix segfault on dir of a DataFrame with a unicode surrogate character in the column name #32701

Uh oh!

Conversation

roberthdevries commented Mar 14, 2020

Uh oh!

WillAyd commented Mar 14, 2020

Uh oh!

roberthdevries commented Mar 14, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roberthdevries commented Mar 15, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

roberthdevries Mar 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Mar 16, 2020

Uh oh!

roberthdevries commented Mar 17, 2020

Uh oh!

WillAyd commented Mar 18, 2020

Uh oh!

roberthdevries commented Mar 18, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

roberthdevries commented Mar 18, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

roberthdevries commented Mar 19, 2020

Uh oh!

WillAyd commented Mar 19, 2020

Uh oh!

Uh oh!

roberthdevries Mar 17, 2020 •

edited

Loading

roberthdevries commented Mar 18, 2020 •

edited

Loading