-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
BUG: Timedelta not formatted correctly in to_json #28595
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR
pandas/io/json/_json.py
Outdated
|
||
if is_timedelta64_dtype(obj.dtype) and self.date_format == "iso": | ||
obj = obj.copy() | ||
self.obj = obj.apply(lambda x: x.isoformat()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm not sure about this approach - this doesn't work if the timedelta is contained within an object
array right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it won't. But the modifications in objToJSON.c should take care of that case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which reminds me, there should probably also be a test for the astype(object)
case. I will add.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to do this at the Python level is ultimately the question though? This seems like it negates the need to make the change to the extension module since these delta objects would subsequently be strings when iso format is requested and not actually delta objects.
FWIW this is going to be rather non-performant, especially for large DataFrames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, the python-level change doesn't entirely negate the need for the change in the extension module.
If one does df.astype(object).to_json(date_format="iso")
, then we do not execute this code in _json.py and instead goes to the modifications made in objToJSON.c.
So in the case where we pass an object array into the extension module, we can take care of the serialization in C. In the case where we attempt to serialize the DataFrame directly, timedeltas appear differently. Namely, they come into the C code as NumPy timedeltas for which there is no method to serialize as ISO. We could perhaps do a conversion to a Pandas timedelta in the C code and then use the isoformat()
method inside the extension module.
I went with this python-level change based on earlier discussions we had about this issue. In particular: "This would mean dispatching to vectorized methods in Python where applicable (i.e. a DatetimeArray is converted to an array of ISO strings before getting to the serializer)..."
We do end up with a less performant result in the case that we directly serialize the DataFrame but may gain in simplicity. The fix for the object
dtype case should still be good performance-wise.
That said, I would be amenable to attempting a fix contained entirely within the extension module.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK. Yea this is a weird in between. Maybe OK for now knowing it fixes the bug and we can worry about performance in a follow up
pandas/tests/io/json/test_pandas.py
Outdated
) | ||
def test_series_timedelta_to_json(self, date_format, delta, formatted): | ||
# GH28156: to_json not correctly formatting Timedelta | ||
s = Series([pd.Timedelta(d) for d in delta]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be able to use pd.timedelta_range
pandas/tests/io/json/test_pandas.py
Outdated
def test_series_timedelta_to_json(self, date_format, delta, formatted): | ||
# GH28156: to_json not correctly formatting Timedelta | ||
s = Series([pd.Timedelta(d) for d in delta]) | ||
s.index = ["one", "two"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is modifying the index critical for the test? If not best to exclude
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wouldn't say it's critical. I had initially thought to include it because in a previous iteration I had some code in there that was resetting the index and therefore causing a test like this to fail. But I can see that if the changes are sufficiently well-constrained to not mess with anything outside of what it should do, then changing the index in the test might not be critical.
result = json.loads(s.to_json(date_format=date_format)) | ||
expected = {k: v for k, v in zip(s.index, formatted)} | ||
|
||
assert result == expected |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm can you make the test compare against the actual expected output? Right now I'm not sure this actually tests the date format, i.e. it could still always write out in epoch and just infer that back on the way in and test would still pass
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
4cacdd1
to
d7879af
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a benchmark that contains time deltas to track performance?
pandas/io/json/_json.py
Outdated
|
||
if is_timedelta64_dtype(obj.dtype) and self.date_format == "iso": | ||
obj = obj.copy() | ||
self.obj = obj.apply(lambda x: x.isoformat()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm OK. Yea this is a weird in between. Maybe OK for now knowing it fixes the bug and we can worry about performance in a follow up
pandas/io/json/_json.py
Outdated
@@ -171,6 +171,34 @@ def _write( | |||
class SeriesWriter(Writer): | |||
_default_orient = "index" | |||
|
|||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you annotate these signatures?
Hello @cbertinato! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found: There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻 Comment last updated at 2019-10-05 16:30:59 UTC |
There was an existing DataFrame that mixed timedeltas, datetimes, and ints, so just to be safe I added another with only timedeltas. I also added a benchmark with
Interestingly, there are some significant changes for int-float DataFrames as well. |
aa9749a
to
72207d4
Compare
I tried instead converting to pandas Timedelta in C which, performance-wise, is less bad than converting in Python, but still not that great:
I wonder whether we can ever do much better than this without a function that formats NumPy timedeltas directly. |
@cbertinato what is status of this? |
Unless I'm missing another way to solve this, it looks like we take a performance hit. So I suppose the question is whether this is tolerable for this fix, or whether we should trash this branch and go with a vectorized solution something like Will's work in #28863. |
instead of this approach i would convert the timedeltas to longs (using vectorized methods on the series) before passing to the c based dumper |
I think this should be easier to fix after #30283. I think will just need a new method for time delta to iso though |
In C? |
Right I think would be easier to do in C since the datetime code is already there |
Sorry for the severe lag on this. Will pick it back up now that #30283 is in. |
@cbertinato I think I'm close on a C impl actually. might be able to push up soon to compare |
Thanks for the PR! I think will be superseded by #30903 though so closing for now |
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff