-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: to_json incorrectly localizes tz-naive datetimes to UTC #46730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow great start. A few things I think we can improve upon but this looks great
@@ -221,8 +221,18 @@ static PyObject *get_values(PyObject *obj) { | |||
// The special cases to worry about are dt64tz and category[dt64tz]. | |||
// In both cases we want the UTC-localized datetime64 ndarray, | |||
// without going through and object array of Timestamps. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@WillAyd In my current (very rough) implementation of this, I've decided to force an array of Timestamps(not sure what the perf/behavior implications of this would be).
Alternatively, I could try to just use the dt64 ndarray, and grab the tz of the DTA(as the tz should be the same). Then, I can probably store this inside the JSONTypeContext and pass that through to int64ToISO function. This muddles up the code a little, since numpy dt64 objects are supposed to be tz-naive and PyDateTimetoISO should really be handling this case.
Do you have a preference for one way or the other?
(also cc @jbrockmendel who may have an opinion)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would avoid adding anything to JSONTypeContext as it’s already way too overloaded. Is this not something we can just manage in the serializer?
Working with an ndarray should be much faster. However, there are also cases where we need to handle datetimes being inside of an object array so need to be robust in how we manage (you might want to add a test for the latter)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the test. Can you elaborate more about how to handle this in the serializer?
I don't see anywhere else where I could store the tz info other than in the TypeContext.
I guess the way it is handled now is fine, but it would be nice to avoid the performance hit from casting to object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might get away with storing this information in the NpyArrayContext
since that already has the type number from NumPy. My hesitation with TypeContext is that it is already pretty sparsely overloaded for different types so its difficult to debug and figure out the intent of it within various contexts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
needs a big whatsnew note
# force dt64tz to go through object dtype | ||
# tz info will be lost when converting to | ||
# dt64 which is naive | ||
return self.values.astype(object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with this change, behavior is equivalent to just removing both this and DatimeLikeBlock.values_for_json. Might mean a perf hit for dt64naive and td64 though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'm not sure I follow here about dt64naive and td64.
Don't those go through DatetimeLikeBlock instead of DatetimeTZBlock? (I didn't change any behavior there).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right. what im saying is that i think NDArrayBackedBlock.values_for_json works, so no longer needs to be overridden
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll take care of this in a follow up.
Sorry for the low quality code. I was prototyping quickly and forgot to clean up afterwards. Regarding #46730, I think I'll stick to the current approach for now. Existing perf benchmarks surprisingly seem unaffected by this change, so not sure its worth pursuing at least for now(it might be worth doing for the tz-aware case later). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one last nit otherwise this lgtm
if (PyObject_HasAttrString(obj, "tzinfo")) { | ||
PyObject *offset = extract_utc_offset(obj); | ||
if (offset == NULL) { | ||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we also need to PyObject_Free(result)
before returning if things fail to avoid any leaks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm - impressive work
thanks @lithomas1 very nice! I believe there is a de-duplication followup (values_for_json) if you can |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.