Skip to content

Implement C Level Timedelta ISO Function; fix JSON usage #30903

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 39 commits into from
Mar 19, 2020

Conversation

WillAyd
Copy link
Member

@WillAyd WillAyd commented Jan 10, 2020

similar to @cbertinato great work in #28595 but done deeper in the C extension

@WillAyd WillAyd added the IO JSON read_json, to_json, json_normalize label Jan 10, 2020
@@ -407,6 +406,30 @@ static char *int64ToIso(int64_t value, NPY_DATETIMEUNIT base, size_t *len) {
return result;
}

/* Converts the int64_t representation of a duration to ISO; mutates len */
static char *int64ToIsoDuration(int64_t value, size_t *len) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think after this PR going to split these conversion functions off into a separate module as the JSON one is getting rather big

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other PR is #31057 should probably be merged first

@@ -1615,6 +1649,11 @@ char **NpyArr_encodeLabels(PyArrayObject *labels, PyObjectEncoder *enc,
ret[i] = PyObject_Malloc(len + 1);
memcpy(ret[i], cLabel, len + 1);

if (is_datetimelike) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fixes a memory leak in the current implementation where any datetimelike values used in the index would leak their respective character representation out; will add a whatsnew once the file is created

@WillAyd
Copy link
Member Author

WillAyd commented Jan 11, 2020

Current benchmarks:

       before           after         ratio
     [447a3b00]       [4a94f159]
     <master>         <json-timedelta>
+         152±3ms         209±10ms     1.38  io.json.ToJSONISO.time_iso_format('values')
+         168±3ms         213±10ms     1.27  io.json.ToJSONISO.time_iso_format('records')
+         188±4ms         235±10ms     1.25  io.json.ToJSONISO.time_iso_format('split')
+         195±4ms         238±10ms     1.22  io.json.ToJSONISO.time_iso_format('columns')
-        103±10ms         79.2±6ms     0.77  io.json.ToJSON.time_to_json('values', 'df_td_int_ts')
-        137±20ms         95.9±1ms     0.70  io.json.ToJSON.time_to_json('values', 'df_int_float_str')

So the time_iso_format checks are taking a little bit longer but note that they were broken before, i.e. on master you get this:

>>> pd.Series([pd.Timedelta(days=1)], index=[pd.Timedelta(days=1)]).to_json(date_format="iso")
'{"P1DT0H0M0S":"1970-01-02T00:00:00.000Z"}'

Where we are writing as dates and not durations.

I think just a matter of the timedelta -> iso function not being as fast (using sprintf instead of totally hand-coded) but maybe OK for now

@@ -905,3 +905,37 @@ int make_iso_8601_datetime(npy_datetimestruct *dts, char *outstr, int outlen,
outlen);
return -1;
}


int make_iso_8601_timedelta(pandas_timedeltastruct *tds,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we do also have a isoformat function for time deltas defined here:

def isoformat(self) -> str:

But I think the path to get that to work in the JSON module would be way more complicated than implementing in C as we do here (and less consistent with other datetime handling).

Maybe as a follow up we want to use this C function in the Timedelta class

@jbrockmendel
Copy link
Member

pls rebase

@jbrockmendel
Copy link
Member

LGTM cc @jreback

@jbrockmendel
Copy link
Member

needs rebase, other good to go i think

@jreback jreback added this to the 1.1 milestone Mar 17, 2020
@jreback
Copy link
Contributor

jreback commented Mar 17, 2020

completely closes #28256. ?

can you check the perf again.

@WillAyd
Copy link
Member Author

WillAyd commented Mar 17, 2020

Yea closes; added some more test coverage as well.

Here is an updated benchmark:

       before           after         ratio
     [402c5cdd]       [ef08ad6c]
     <json-test-cleanup~1^2>       <json-timedelta>
+        143±20ms         225±10ms     1.57  io.json.ToJSONISO.time_iso_format('values')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2020

this is slower?

@WillAyd
Copy link
Member Author

WillAyd commented Mar 17, 2020

Slower but correct

@jbrockmendel
Copy link
Member

Slower but correct

worth the tradeoff every time.

@jreback jreback merged commit 80078ac into pandas-dev:master Mar 19, 2020
@jreback
Copy link
Contributor

jreback commented Mar 19, 2020

thanks, correct wins every time.

SeeminSyed pushed a commit to CSCD01-team01/pandas that referenced this pull request Mar 22, 2020
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 23, 2020
jbrockmendel pushed a commit to jbrockmendel/pandas that referenced this pull request Mar 25, 2020
@WillAyd WillAyd deleted the json-timedelta branch April 12, 2023 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timedelta in to_json object array and ISO dates not handled properly
4 participants