Skip to content

JSON Date Handling 1.0 Regressions #30977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 18, 2020
Merged

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Jan 13, 2020

closes #30904

Came across these adding parametrized tests in #30903 which this pulls some elements from but which will also ultimately refactor to avoid duplication; just getting this to work as is

To be clear the regressions were:

  1. ISO dates with datetime.date objects was broken
>>> import pandas as pd
>>> import datetime
>>> data = [datetime.date(year=2020, month=1, day=1), "a"]
>>> pd.Series(data).to_json(date_format="iso")
TypeError: Expected datetime object
  1. pd.NaT labels would now yield the sentinel value instead of writing "null"
>>> data = [pd.Timestamp("2020-01-01"), pd.NaT]
>>> pd.Series(data, index=data).to_json()
'{"1577836800000":1577836800000,"-9223372036854":null}'

Both have been fixed here

@pep8speaks
Copy link

pep8speaks commented Jan 13, 2020

Hello @WillAyd! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-01-15 18:26:32 UTC

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Jan 13, 2020
@@ -456,7 +456,7 @@ static char *PyDateTimeToIso(PyDateTime_Date *obj, NPY_DATETIMEUNIT base,
static char *PyDateTimeToIsoCallback(JSOBJ obj, JSONTypeContext *tc,
size_t *len) {

if (!PyDateTime_Check(obj)) {
if (!PyDate_Check(obj)) {
PyErr_SetString(PyExc_TypeError, "Expected datetime object");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the error message say "date" instead of "datetime"? is the call to PyDateTimeToIso below safe if we have a date and not a datetime?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to change; was thinking about that myself just didn't want to go down a rabbit hole of cleaning this all up just yet

w.r.t your second question that error check currently doesn't actually raise an error, so I guess it's a no-op (though it probably should do something)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed commented areas to date for now; note the second would segfault if you passed it something that wasn't a date object but I think out of scope to fix here

@simonjayhawkins simonjayhawkins added this to the 1.0.0 milestone Jan 13, 2020
@simonjayhawkins simonjayhawkins added the Regression Functionality that used to work in a prior pandas version label Jan 13, 2020
@jreback
Copy link
Contributor

jreback commented Jan 14, 2020

does this have similar perf to master?

@WillAyd
Copy link
Member Author

WillAyd commented Jan 14, 2020

Looks like a mixed bag:

       before           after         ratio
     [d78c0616]       [25444958]
     <master>         <json-date-regr>
+         115±4ms         231±10ms     2.02  io.json.ToJSON.time_to_json('columns', 'df')
+         103±3ms         161±10ms     1.57  io.json.ToJSON.time_to_json('records', 'df')
+        205±10ms         282±20ms     1.37  io.json.ToJSON.time_to_json_wide('index', 'df_int_floats')
-        265±30ms         194±10ms     0.73  io.json.ToJSON.time_to_json_wide('split', 'df_int_float_str')
-        235±10ms          141±9ms     0.60  io.json.ToJSON.time_to_json_wide('records', 'df')
-         276±7ms          141±4ms     0.51  io.json.ToJSON.time_to_json('index', 'df_td_int_ts')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

May want to just couple this with #30903 which has a refactor which helps performance and fixes Timedelta issues, at the cost of a little more code movement

@jbrockmendel
Copy link
Member

@WillAyd it looks like between this and #30903 perf is a mixed bag but there are some unambiguous bug fixes?

@WillAyd
Copy link
Member Author

WillAyd commented Jan 15, 2020

#30903 fixes #28256 and an unreported memory leak when using datetimelike labels.

There is overlap between that and this PR. I consider this the bare minimum to fix regressions, though would have a slight inclination towards doing this and #30903 (after rebase) to keep performance the same without jumping through a ton of hoops

@jreback
Copy link
Contributor

jreback commented Jan 15, 2020

#30903 fixes #28256 and an unreported memory leak when using datetimelike labels.

There is overlap between that and this PR. I consider this the bare minimum to fix regressions, though would have a slight inclination towards doing this and #30903 (after rebase) to keep performance the same without jumping through a ton of hoops

not averse to doing another patch to fix perf, but is there a minimal patch you can add to this one that doesn't involve timedelta (which i view as a nice-to-have); the datetimes are the regression, yes?

@WillAyd
Copy link
Member Author

WillAyd commented Jan 15, 2020

Yea can probably extract something out tonirrow

@WillAyd
Copy link
Member Author

WillAyd commented Jan 15, 2020

To keep performance inline I pulled in the encodeLabels refactor from #30903, though not doing any of the Timedelta fixes.

To help translate the diff, just note that the code now basically checks for up front for a date-like value and converts to its nanosecond resolution (note that NPY_DATETIME includes both numpy datetimes and time deltas). That nanosecond resolution gets checked against NaT and continues further to the appropriate serializer depending on the type of object supplied and other keyword arguments

Here are the benchmarks:

       before           after         ratio
     [bc9d329b]       [c102f5b9]
     <master>         <json-date-regr>
-        242±20ms          173±2ms     0.72  io.json.ToJSON.time_to_json_wide('columns', 'df_int_floats')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE INCREASED.

@@ -1615,6 +1641,10 @@ char **NpyArr_encodeLabels(PyArrayObject *labels, PyObjectEncoder *enc,
ret[i] = PyObject_Malloc(len + 1);
memcpy(ret[i], cLabel, len + 1);

if (is_datetimelike) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a memory leak before; new structure makes it easier to clean this up

len = strlen(cLabel);
if (enc->datetimeIso) {
// TODO: Vectorized Timedelta function
if ((type_num == NPY_TIMEDELTA) || (PyDelta_Check(item))) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entire timedelta block was mostly copy / paste with a few changes at the end to fit the memory assumptions; this will be blown away by #30903

@jreback jreback merged commit 9c33464 into pandas-dev:master Jan 18, 2020
@jreback
Copy link
Contributor

jreback commented Jan 18, 2020

thanks @WillAyd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging this pull request may close these issues.

to_json doesn't work with Datetime.date
6 participants