Skip to content

BUG: passing non-printable unicode to datetime parsing functions #30374

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from

Conversation

jbrockmendel
Copy link
Member

ATM these segfault (on OSX at least). There are more of these that I'll get in upcoming passes (I think no more than 2 more will be needed), keeping this one small to make the issue obvious and improve some of the typing in functions that will be checked in the next passes.

Needs double-checking: is the val.isprintable() check equivalent to "PyUnicode_AsUTF8AndSize(val, length) wont segfault"? or is that just an approximation?

@simonjayhawkins simonjayhawkins added Bug Timezones Timezone data dtype labels Dec 20, 2019
@simonjayhawkins simonjayhawkins added this to the 1.0 milestone Dec 20, 2019
@simonjayhawkins
Copy link
Member

@jbrockmendel need to run black

@jbrockmendel
Copy link
Member Author

Looks like there is a Py_UNICODE_ISPRINTABLE that we might able to use; looks like cython doesnt make that optimization on its own

@jreback
Copy link
Contributor

jreback commented Dec 20, 2019

can you just confirm that there isn't a perf hit (and/or use the cython optimization)

@jbrockmendel
Copy link
Member Author

and/or use the cython optimization)

first attempt to use it gave me compile-time troubles. will try to figure that out, but also hoping we dont need it.

There are some places where we call this possibly-dangerous function inside a loop, but those are excluded from this PR so we should be OK. Am running asvs now anyway to be on the safe side

@jbrockmendel
Copy link
Member Author

In [1]: from pandas._libs.tslibs.parsing import parse_datetime_string, parse_time_string                                                                                                                        
In [2]: %timeit parse_time_string("2016-01-02 03:04:05")                                                                                                                                                        
92.3 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)    # <-- master
90.8 µs ± 2.96 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- PR

In [3] %timeit parse_datetime_string("2016-01-02 03:04:05")                                                                                                                                                   
76.9 µs ± 1.14 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- master
77.6 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)  # <-- PR

These are the only functions which are directly affected by the new check, all within the margin of error.

@WillAyd
Copy link
Member

WillAyd commented Dec 21, 2019

Do you know where this is segfaulting? Just wonder if the parsing functions shouldn’t more gracefully fail here instead of having to explicitly check for this

@jbrockmendel
Copy link
Member Author

Do you know where this is segfaulting?

It's the same underlying issue as #25509 when we call tslibs.util.get_c_string or get_c_string_and_whatever. It's any string py_string such that py_string.encode("UTF-8") raises. IIUC the py_string.isprintable() check is a standing in for that can-we-encode check.

Just wonder if the parsing functions shouldn’t more gracefully fail here instead of having to explicitly check for this

Not sure I understand. The way it fails in this PR seems like a lot more graceful than a segfault.

@WillAyd
Copy link
Member

WillAyd commented Dec 21, 2019

I think the segfault is actually that get_c_string_buf_and_size returns NULL on error and sets an exception:

https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AsUTF8AndSize

but the functions that call it aren't checking for the NULL case. IIUC should be checking for that and propagating the error rather than the is_printable check being added here

@jbrockmendel
Copy link
Member Author

I think the segfault is actually that get_c_string_buf_and_size returns NULL on error and sets an exception:

That was my read of the docs too. So I tried adding a check for NULL

buf = PyUnicode_AsUTF8AndSize(py_string, length)
if buf is NULL:
    buf = PyUnicode_AsUTF8AndSize(repr(py_string), length)

bit I think it still segfaulted on the first call. It's totally plausible that I got the semantics wrong

@WillAyd
Copy link
Member

WillAyd commented Dec 21, 2019

Hmm OK there is something weird here. To reproduce I created this extension module:

#define PY_SSIZE_T_CLEAN
#include <Python.h>


// should error for bad input (not segfault)
static PyObject *error_if_not_utf8(PyObject *dummy, PyObject *args) {
  Py_ssize_t size;
  PyObject *str;
  
  int ok = PyArg_ParseTuple(args, "O", &str);
  if (!ok)
    return NULL;
  
  const char* result = PyUnicode_AsUTF8AndSize(str, &size);
  if (result == NULL)
    return NULL;

  return str;
}

static PyMethodDef TestMethods[] = {
    {"error_if_not_utf8", error_if_not_utf8, METH_VARARGS,
     "Check if surrogates cause a segfault."},
    {NULL, NULL, 0, NULL}};

static struct PyModuleDef testmodule = {PyModuleDef_HEAD_INIT,
                                          "libtest", // Name of module
                                          NULL,        // Documentation
                                          -1, // keep state in global variables
                                          TestMethods};

PyMODINIT_FUNC PyInit_libtest(void) {
    return PyModule_Create(&testmodule);
}

compiled using this setup file:

from setuptools import Extension, setup

test_module = Extension(
    "libtest",
    sources=["testmodule.c",]
)


setup(
    name="test",
    ext_modules=[test_module],
)

And executed this code in the REPL:

import libtest

libtest.error_if_not_utf8("foo")
libtest.error_if_not_utf8("\ud83d")
libtest.error_if_not_utf8("foo")
libtest.error_if_not_utf8("\ud83d")

Which segfaulted on the second call with the invalid surrogate

I opened https://bugs.python.org/issue39113?@ok_message=file%2048798%20created%0Amsg%20358755%20created%0Aissue%2039113%20created&@template=item so maybe see if we can get some clarification there as well

I'm just slightly wary of changing things on our end for this as this is a rather esoteric thing to face in the first place, but no problem if others disagree

@jbrockmendel
Copy link
Member Author

Definitely weird, definitely worth being extra-careful about. Most of what is in this PR ended up being only tangentially related to the segfault, so ill put those cleanups in a separate PR so this can have a tight focus.

@jbrockmendel
Copy link
Member Author

I'm going to close this PR and do the non-segfault-adjacent parts in a separate PR. The rest of the discussion can move back to #25509 and the BPO issue

@jbrockmendel jbrockmendel mentioned this pull request Dec 21, 2019
1 task
@jbrockmendel jbrockmendel deleted the bug-get_c_string branch April 5, 2020 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants