-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: passing non-printable unicode to datetime parsing functions #30374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@jbrockmendel need to run black |
Looks like there is a Py_UNICODE_ISPRINTABLE that we might able to use; looks like cython doesnt make that optimization on its own |
can you just confirm that there isn't a perf hit (and/or use the cython optimization) |
first attempt to use it gave me compile-time troubles. will try to figure that out, but also hoping we dont need it. There are some places where we call this possibly-dangerous function inside a loop, but those are excluded from this PR so we should be OK. Am running asvs now anyway to be on the safe side |
These are the only functions which are directly affected by the new check, all within the margin of error. |
Do you know where this is segfaulting? Just wonder if the parsing functions shouldn’t more gracefully fail here instead of having to explicitly check for this |
It's the same underlying issue as #25509 when we call tslibs.util.get_c_string or get_c_string_and_whatever. It's any string
Not sure I understand. The way it fails in this PR seems like a lot more graceful than a segfault. |
I think the segfault is actually that https://docs.python.org/3/c-api/unicode.html#c.PyUnicode_AsUTF8AndSize but the functions that call it aren't checking for the NULL case. IIUC should be checking for that and propagating the error rather than the |
That was my read of the docs too. So I tried adding a check for NULL
bit I think it still segfaulted on the first call. It's totally plausible that I got the semantics wrong |
Hmm OK there is something weird here. To reproduce I created this extension module: #define PY_SSIZE_T_CLEAN
#include <Python.h>
// should error for bad input (not segfault)
static PyObject *error_if_not_utf8(PyObject *dummy, PyObject *args) {
Py_ssize_t size;
PyObject *str;
int ok = PyArg_ParseTuple(args, "O", &str);
if (!ok)
return NULL;
const char* result = PyUnicode_AsUTF8AndSize(str, &size);
if (result == NULL)
return NULL;
return str;
}
static PyMethodDef TestMethods[] = {
{"error_if_not_utf8", error_if_not_utf8, METH_VARARGS,
"Check if surrogates cause a segfault."},
{NULL, NULL, 0, NULL}};
static struct PyModuleDef testmodule = {PyModuleDef_HEAD_INIT,
"libtest", // Name of module
NULL, // Documentation
-1, // keep state in global variables
TestMethods};
PyMODINIT_FUNC PyInit_libtest(void) {
return PyModule_Create(&testmodule);
} compiled using this setup file: from setuptools import Extension, setup
test_module = Extension(
"libtest",
sources=["testmodule.c",]
)
setup(
name="test",
ext_modules=[test_module],
) And executed this code in the REPL:
Which segfaulted on the second call with the invalid surrogate I opened https://bugs.python.org/issue39113?@ok_message=file%2048798%20created%0Amsg%20358755%20created%0Aissue%2039113%20created&@template=item so maybe see if we can get some clarification there as well I'm just slightly wary of changing things on our end for this as this is a rather esoteric thing to face in the first place, but no problem if others disagree |
Definitely weird, definitely worth being extra-careful about. Most of what is in this PR ended up being only tangentially related to the segfault, so ill put those cleanups in a separate PR so this can have a tight focus. |
I'm going to close this PR and do the non-segfault-adjacent parts in a separate PR. The rest of the discussion can move back to #25509 and the BPO issue |
ATM these segfault (on OSX at least). There are more of these that I'll get in upcoming passes (I think no more than 2 more will be needed), keeping this one small to make the issue obvious and improve some of the typing in functions that will be checked in the next passes.
Needs double-checking: is the
val.isprintable()
check equivalent to "PyUnicode_AsUTF8AndSize(val, length)
wont segfault"? or is that just an approximation?