Skip to content

BUG: to_json incorrectly localizes tz-naive datetimes to UTC #46730

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
May 4, 2022

Conversation

lithomas1
Copy link
Member

@lithomas1 lithomas1 added Bug IO JSON read_json, to_json, json_normalize labels Apr 10, 2022
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow great start. A few things I think we can improve upon but this looks great

@@ -221,8 +221,18 @@ static PyObject *get_values(PyObject *obj) {
// The special cases to worry about are dt64tz and category[dt64tz].
// In both cases we want the UTC-localized datetime64 ndarray,
// without going through and object array of Timestamps.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@WillAyd In my current (very rough) implementation of this, I've decided to force an array of Timestamps(not sure what the perf/behavior implications of this would be).

Alternatively, I could try to just use the dt64 ndarray, and grab the tz of the DTA(as the tz should be the same). Then, I can probably store this inside the JSONTypeContext and pass that through to int64ToISO function. This muddles up the code a little, since numpy dt64 objects are supposed to be tz-naive and PyDateTimetoISO should really be handling this case.

Do you have a preference for one way or the other?
(also cc @jbrockmendel who may have an opinion)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid adding anything to JSONTypeContext as it’s already way too overloaded. Is this not something we can just manage in the serializer?

Working with an ndarray should be much faster. However, there are also cases where we need to handle datetimes being inside of an object array so need to be robust in how we manage (you might want to add a test for the latter)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the test. Can you elaborate more about how to handle this in the serializer?
I don't see anywhere else where I could store the tz info other than in the TypeContext.

I guess the way it is handled now is fine, but it would be nice to avoid the performance hit from casting to object.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might get away with storing this information in the NpyArrayContext since that already has the type number from NumPy. My hesitation with TypeContext is that it is already pretty sparsely overloaded for different types so its difficult to debug and figure out the intent of it within various contexts

@lithomas1 lithomas1 marked this pull request as ready for review April 16, 2022 16:28
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs a big whatsnew note

# force dt64tz to go through object dtype
# tz info will be lost when converting to
# dt64 which is naive
return self.values.astype(object)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with this change, behavior is equivalent to just removing both this and DatimeLikeBlock.values_for_json. Might mean a perf hit for dt64naive and td64 though

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'm not sure I follow here about dt64naive and td64.
Don't those go through DatetimeLikeBlock instead of DatetimeTZBlock? (I didn't change any behavior there).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. what im saying is that i think NDArrayBackedBlock.values_for_json works, so no longer needs to be overridden

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll take care of this in a follow up.

@jreback jreback added this to the 1.5 milestone Apr 26, 2022
@lithomas1
Copy link
Member Author

Sorry for the low quality code. I was prototyping quickly and forgot to clean up afterwards.

Regarding #46730, I think I'll stick to the current approach for now. Existing perf benchmarks surprisingly seem unaffected by this change, so not sure its worth pursuing at least for now(it might be worth doing for the tz-aware case later).

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one last nit otherwise this lgtm

if (PyObject_HasAttrString(obj, "tzinfo")) {
PyObject *offset = extract_utc_offset(obj);
if (offset == NULL) {
return NULL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we also need to PyObject_Free(result) before returning if things fail to avoid any leaks

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm - impressive work

@jreback jreback merged commit 009c4c6 into pandas-dev:main May 4, 2022
@jreback
Copy link
Contributor

jreback commented May 4, 2022

thanks @lithomas1 very nice!

I believe there is a de-duplication followup (values_for_json) if you can

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pandas.Series.to_json() incorrectly localizes tz-naive datetimes to UTC
4 participants