Skip to content

Timedelta in to_json object array and ISO dates not handled properly #28256

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
WillAyd opened this issue Sep 2, 2019 · 7 comments · Fixed by #30903
Closed

Timedelta in to_json object array and ISO dates not handled properly #28256

WillAyd opened this issue Sep 2, 2019 · 7 comments · Fixed by #30903
Labels
IO JSON read_json, to_json, json_normalize Timedelta Timedelta data type
Milestone

Comments

@WillAyd
Copy link
Member

WillAyd commented Sep 2, 2019

Intertwined with #15137 though a slight different issue.

date_format="iso" has different behavior for Timedeltas depending on whether or not the Timedelta is in a DTA or an object array. To illustrate:

# Wrong format ref 15137, but at least tries to do some formatting
>>> pd.DataFrame([[pd.Timedelta("1D")]]).to_json(date_format="iso")
'{"0":{"0":"1970-01-02T00:00:00.000Z"}}'

# Object array has no formatting
>>> pd.DataFrame([[pd.Timedelta("1D")]]).astype(object).to_json(date_format="iso")
'{"0":{"0":86400000}}'

By contrast the same issue does not appear with datetimes

>>> pd.DataFrame([[pd.Timestamp(1)]]).to_json(date_format="iso")
'{"0":{"0":"1970-01-01T00:00:00.000Z"}}'

# Below still formats as iso in spite of being object array
>>> pd.DataFrame([[pd.Timestamp(1)]]).astype(object).to_json(date_format="iso")
'{"0":{"0":"1970-01-01T00:00:00.000Z"}}'
@WillAyd WillAyd added IO JSON read_json, to_json, json_normalize Timedelta Timedelta data type labels Sep 2, 2019
@cbertinato
Copy link
Contributor

What is the desired behavior here? That the first case using Timedelta look like the second (casting to object)?

@WillAyd
Copy link
Member Author

WillAyd commented Sep 5, 2019

The time delta should be written out using iso format just like datetimes are (though using the duration ISO format) whether or not included in an object array or DTA

@cbertinato
Copy link
Contributor

Sounds good. I'll take a whack at it if no one else has professed a desire.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 5, 2019

Awesome! If it's any help I'm fairly certain the issue is that the PyDelta check in the extension module that serializes everything doesn't make consideration as to whether or not the date format should be iso.

Here's where I think this is relevant:

} else if (PyDelta_Check(obj)) {

If you did some kind of conditional branch as follows:

if (enc->datetimeIso) {

The true branch should dispatch to the isoformat method. The existing code can be moved to the false branch

@cbertinato
Copy link
Contributor

Thanks Will! That definitely saved some time getting to the root of one of the issues and I think the fix is just as you say. This takes care of the case when you do astype(object), but without, we still get the timedelta serialized as a datetime.

In this case, we pass through the PyDelta_Check:

} else if (PyDelta_Check(obj)) {

and instead get into the check here:

} else if (PyArray_Check(obj)) {

then here:

int NpyArr_iterNextItem(JSOBJ obj, JSONTypeContext *tc) {

which sets npyType (to 22 for NPY_TIMEDELTA):

((PyObjectEncoder *)tc->encoder)->npyType = PyArray_TYPE(npyarr->array);

then go back to Object_beginTypeContext and we get here:

tc->type = NpyTypeToJSONType(obj, tc, enc->npyType, enc->npyValue);

and finally here:

GET_TC(tc)->PyTypeToJSON = NpyDatetime64ToJSON;

So it appears that a simple check somewhere before that last line to differentiate a timedelta from a datetime could work, but the issue here is that now we're dealing with a NumPy timedelta64 rather than a pandas timedelta so we don't have the isoformat method at our disposal.

This seems to be pointing to a solution where we define something like a NpyTimedelta64ToJSON. I just want to lay all of this out in case there's a simpler and more straightforward solution to this issue that I'm missing.

I suppose another possible solution is to convert the NP timedelta64 to a pandas TimeDelta and then proceed as we do for the other case, but aside for what effects that might have on performance, making such a conversion in C doesn't appear to be as straightforward as it sounds.

@WillAyd
Copy link
Member Author

WillAyd commented Sep 8, 2019

Great thanks for diving into this. For now I think it would be OK to just create a pandas Timedelta object and call isoformat off of that. That would be less performant than a native C routine but simpler at the same time

Generally I think we need to simplify our JSON serialization as it's been rather stale development-wise for a few years, so I've thought about doing as much logic in the Python layer as possible before sending to JSON. This would mean dispatching to vectorized methods in Python where applicable (i.e. a DatetimeArray is converted to an array of ISO strings before getting to the serializer) and for one-off items just converting to the equivalent pandas object during serialization.
We'd be giving up a bit of performance in the object dtype cases but I think simplicity is going to be worth that trade off.

^ just thinking out loud so as you look at it any ideas you have are certainly welcome

@cbertinato
Copy link
Contributor

Thanks Will. That's pretty much how I decided to do it. Looks like that's how it was done in the JSONTableWriter as well. PR coming shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Timedelta Timedelta data type
Projects
None yet
3 participants