BUG: passing non-printable unicode to datetime parsing functions #30374

jbrockmendel · 2019-12-20T05:33:14Z

ATM these segfault (on OSX at least). There are more of these that I'll get in upcoming passes (I think no more than 2 more will be needed), keeping this one small to make the issue obvious and improve some of the typing in functions that will be checked in the next passes.

Needs double-checking: is the val.isprintable() check equivalent to "PyUnicode_AsUTF8AndSize(val, length) wont segfault"? or is that just an approximation?

simonjayhawkins · 2019-12-20T10:29:55Z

@jbrockmendel need to run black

…g-get_c_string

jbrockmendel · 2019-12-20T16:26:35Z

Looks like there is a Py_UNICODE_ISPRINTABLE that we might able to use; looks like cython doesnt make that optimization on its own

jreback · 2019-12-20T17:59:29Z

can you just confirm that there isn't a perf hit (and/or use the cython optimization)

jbrockmendel · 2019-12-20T20:01:14Z

and/or use the cython optimization)

first attempt to use it gave me compile-time troubles. will try to figure that out, but also hoping we dont need it.

There are some places where we call this possibly-dangerous function inside a loop, but those are excluded from this PR so we should be OK. Am running asvs now anyway to be on the safe side

jbrockmendel · 2019-12-20T21:29:25Z

In [1]: from pandas._libs.tslibs.parsing import parse_datetime_string, parse_time_string                                                                                                                        
In [2]: %timeit parse_time_string("2016-01-02 03:04:05")                                                                                                                                                        
92.3 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)    # <-- master
90.8 µs ± 2.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- PR

In [3] %timeit parse_datetime_string("2016-01-02 03:04:05")                                                                                                                                                   
76.9 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- master
77.6 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- PR

These are the only functions which are directly affected by the new check, all within the margin of error.

WillAyd · 2019-12-21T02:20:54Z

Do you know where this is segfaulting? Just wonder if the parsing functions shouldn’t more gracefully fail here instead of having to explicitly check for this

jbrockmendel · 2019-12-21T02:34:50Z

Do you know where this is segfaulting?

It's the same underlying issue as #25509 when we call tslibs.util.get_c_string or get_c_string_and_whatever. It's any string py_string such that py_string.encode("UTF-8") raises. IIUC the py_string.isprintable() check is a standing in for that can-we-encode check.

Just wonder if the parsing functions shouldn’t more gracefully fail here instead of having to explicitly check for this

Not sure I understand. The way it fails in this PR seems like a lot more graceful than a segfault.

WillAyd · 2019-12-21T02:59:20Z

I think the segfault is actually that get_c_string_buf_and_size returns NULL on error and sets an exception:

https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AsUTF8AndSize

but the functions that call it aren't checking for the NULL case. IIUC should be checking for that and propagating the error rather than the is_printable check being added here

jbrockmendel · 2019-12-21T03:06:38Z

I think the segfault is actually that get_c_string_buf_and_size returns NULL on error and sets an exception:

That was my read of the docs too. So I tried adding a check for NULL

buf = PyUnicode_AsUTF8AndSize(py_string, length)
if buf is NULL:
    buf = PyUnicode_AsUTF8AndSize(repr(py_string), length)

bit I think it still segfaulted on the first call. It's totally plausible that I got the semantics wrong

WillAyd · 2019-12-21T03:37:44Z

Hmm OK there is something weird here. To reproduce I created this extension module:

#define PY_SSIZE_T_CLEAN
#include <Python.h>


// should error for bad input (not segfault)
static PyObject *error_if_not_utf8(PyObject *dummy, PyObject *args) {
  Py_ssize_t size;
  PyObject *str;
  
  int ok = PyArg_ParseTuple(args, "O", &str);
  if (!ok)
    return NULL;
  
  const char* result = PyUnicode_AsUTF8AndSize(str, &size);
  if (result == NULL)
    return NULL;

  return str;
}

static PyMethodDef TestMethods[] = {
    {"error_if_not_utf8", error_if_not_utf8, METH_VARARGS,
     "Check if surrogates cause a segfault."},
    {NULL, NULL, 0, NULL}};

static struct PyModuleDef testmodule = {PyModuleDef_HEAD_INIT,
                                          "libtest", // Name of module
                                          NULL,        // Documentation
                                          -1, // keep state in global variables
                                          TestMethods};

PyMODINIT_FUNC PyInit_libtest(void) {
    return PyModule_Create(&testmodule);
}

compiled using this setup file:

from setuptools import Extension, setup

test_module = Extension(
    "libtest",
    sources=["testmodule.c",]
)


setup(
    name="test",
    ext_modules=[test_module],
)

And executed this code in the REPL:

import libtest

libtest.error_if_not_utf8("foo")
libtest.error_if_not_utf8("\ud83d")
libtest.error_if_not_utf8("foo")
libtest.error_if_not_utf8("\ud83d")

Which segfaulted on the second call with the invalid surrogate

I opened https://bugs.python.org/issue39113?@ok_message=file%2048798%20created%0Amsg%20358755%20created%0Aissue%2039113%20created&@template=item so maybe see if we can get some clarification there as well

I'm just slightly wary of changing things on our end for this as this is a rather esoteric thing to face in the first place, but no problem if others disagree

jbrockmendel · 2019-12-21T03:47:24Z

Definitely weird, definitely worth being extra-careful about. Most of what is in this PR ended up being only tangentially related to the segfault, so ill put those cleanups in a separate PR so this can have a tight focus.

jbrockmendel · 2019-12-21T20:10:22Z

I'm going to close this PR and do the non-segfault-adjacent parts in a separate PR. The rest of the discussion can move back to #25509 and the BPO issue

BUG: passing non-printable unicode to datetime parsing functions

63cab41

simonjayhawkins added Bug Timezones Timezone data dtype labels Dec 20, 2019

simonjayhawkins added this to the 1.0 milestone Dec 20, 2019

Merge branch 'master' of https://github.com/pandas-dev/pandas into bu…

ebaabe9

…g-get_c_string

blackify

cbe63ab

jbrockmendel closed this Dec 21, 2019

jbrockmendel mentioned this pull request Dec 21, 2019

CLN: tslibs.parsing #30394

Merged

1 task

jbrockmendel deleted the bug-get_c_string branch April 5, 2020 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: passing non-printable unicode to datetime parsing functions #30374

BUG: passing non-printable unicode to datetime parsing functions #30374

jbrockmendel commented Dec 20, 2019

simonjayhawkins commented Dec 20, 2019

jbrockmendel commented Dec 20, 2019

jreback commented Dec 20, 2019

jbrockmendel commented Dec 20, 2019

jbrockmendel commented Dec 20, 2019

WillAyd commented Dec 21, 2019

jbrockmendel commented Dec 21, 2019

WillAyd commented Dec 21, 2019

jbrockmendel commented Dec 21, 2019

WillAyd commented Dec 21, 2019

jbrockmendel commented Dec 21, 2019

jbrockmendel commented Dec 21, 2019

BUG: passing non-printable unicode to datetime parsing functions #30374

BUG: passing non-printable unicode to datetime parsing functions #30374

Conversation

jbrockmendel commented Dec 20, 2019

simonjayhawkins commented Dec 20, 2019

jbrockmendel commented Dec 20, 2019

jreback commented Dec 20, 2019

jbrockmendel commented Dec 20, 2019

jbrockmendel commented Dec 20, 2019

WillAyd commented Dec 21, 2019

jbrockmendel commented Dec 21, 2019

WillAyd commented Dec 21, 2019

jbrockmendel commented Dec 21, 2019

WillAyd commented Dec 21, 2019

jbrockmendel commented Dec 21, 2019

jbrockmendel commented Dec 21, 2019