-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
JSON Date Handling 1.0 Regressions #30977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -456,7 +456,7 @@ static char *PyDateTimeToIso(PyDateTime_Date *obj, NPY_DATETIMEUNIT base, | |||
static char *PyDateTimeToIsoCallback(JSOBJ obj, JSONTypeContext *tc, | |||
size_t *len) { | |||
|
|||
if (!PyDateTime_Check(obj)) { | |||
if (!PyDate_Check(obj)) { | |||
PyErr_SetString(PyExc_TypeError, "Expected datetime object"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should the error message say "date" instead of "datetime"? is the call to PyDateTimeToIso below safe if we have a date and not a datetime?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to change; was thinking about that myself just didn't want to go down a rabbit hole of cleaning this all up just yet
w.r.t your second question that error check currently doesn't actually raise an error, so I guess it's a no-op (though it probably should do something)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed commented areas to date for now; note the second would segfault if you passed it something that wasn't a date object but I think out of scope to fix here
does this have similar perf to master? |
Looks like a mixed bag: before after ratio
[d78c0616] [25444958]
<master> <json-date-regr>
+ 115±4ms 231±10ms 2.02 io.json.ToJSON.time_to_json('columns', 'df')
+ 103±3ms 161±10ms 1.57 io.json.ToJSON.time_to_json('records', 'df')
+ 205±10ms 282±20ms 1.37 io.json.ToJSON.time_to_json_wide('index', 'df_int_floats')
- 265±30ms 194±10ms 0.73 io.json.ToJSON.time_to_json_wide('split', 'df_int_float_str')
- 235±10ms 141±9ms 0.60 io.json.ToJSON.time_to_json_wide('records', 'df')
- 276±7ms 141±4ms 0.51 io.json.ToJSON.time_to_json('index', 'df_td_int_ts')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED. May want to just couple this with #30903 which has a refactor which helps performance and fixes Timedelta issues, at the cost of a little more code movement |
#30903 fixes #28256 and an unreported memory leak when using datetimelike labels. There is overlap between that and this PR. I consider this the bare minimum to fix regressions, though would have a slight inclination towards doing this and #30903 (after rebase) to keep performance the same without jumping through a ton of hoops |
not averse to doing another patch to fix perf, but is there a minimal patch you can add to this one that doesn't involve timedelta (which i view as a nice-to-have); the datetimes are the regression, yes? |
Yea can probably extract something out tonirrow |
To keep performance inline I pulled in the encodeLabels refactor from #30903, though not doing any of the Timedelta fixes. To help translate the diff, just note that the code now basically checks for up front for a date-like value and converts to its nanosecond resolution (note that NPY_DATETIME includes both numpy datetimes and time deltas). That nanosecond resolution gets checked against NaT and continues further to the appropriate serializer depending on the type of object supplied and other keyword arguments Here are the benchmarks: before after ratio
[bc9d329b] [c102f5b9]
<master> <json-date-regr>
- 242±20ms 173±2ms 0.72 io.json.ToJSON.time_to_json_wide('columns', 'df_int_floats')
SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED. |
@@ -1615,6 +1641,10 @@ char **NpyArr_encodeLabels(PyArrayObject *labels, PyObjectEncoder *enc, | |||
ret[i] = PyObject_Malloc(len + 1); | |||
memcpy(ret[i], cLabel, len + 1); | |||
|
|||
if (is_datetimelike) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a memory leak before; new structure makes it easier to clean this up
len = strlen(cLabel); | ||
if (enc->datetimeIso) { | ||
// TODO: Vectorized Timedelta function | ||
if ((type_num == NPY_TIMEDELTA) || (PyDelta_Check(item))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This entire timedelta block was mostly copy / paste with a few changes at the end to fit the memory assumptions; this will be blown away by #30903
thanks @WillAyd |
Co-authored-by: William Ayd <[email protected]>
closes #30904
Came across these adding parametrized tests in #30903 which this pulls some elements from but which will also ultimately refactor to avoid duplication; just getting this to work as is
To be clear the regressions were:
pd.NaT
labels would now yield the sentinel value instead of writing "null"Both have been fixed here