-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add JSON export option for DataFrame (take 2) #1263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Cool. Do you have an opinion on whether timestamps should be converted to JavaScript timestamps (milliseconds since the epoch)? Everything is nanoseconds now in pandas. I'm guessing "probably not" but I guess it could be an option in to_json |
Hmm when decoding to DataFrames it obviously shouldn't be an issue but it would be nice to have the option of milliseconds and/or seconds for sharing data with JavaScript, assuming it's easy to deduce the unit from the datetime object. Esp as JSON will probably (?!) primarily be used for sending data client side (i.e. to browsers). Note the original ujson encodes datetimes to seconds and I don't think there is a standard JSON timestamp unit. Sort of a tangent but it would also be nice to have an efficient way of converting those timestamps back to datetimes when rebuilding the DataFrame (I'm not suggesting it should be part of the JSON decoding). Maybe that exists already though and I'm just not aware of it? |
Just throwing in my 2 cents; I'd really like a simple way to export a DataFrame as JSON in Pandas. |
@drBunsen after this PR is merged, it will be dirt simple: |
@Komnomnomnom working on the timestamp handling issues. Doing everything (in particular, working with the pandas data structures) in C is probably not be the best long-term solution since |
OK I think I have the kludge mostly sorted. @Komnomnomnom you have CRLF line endings in these files: |
Cool, I merged this. There will be a bunch of followup issues no doubt as things settle. One question is how to encode datetime.datetime objects:
I have the appropriate code in |
Thanks Wes. Re the CRLF, I initially preserved the line endings and tabs just to be consistent with the original ujson. Should have switched them for pandas though :). |
It will probably be necessary to add special handling functions (in either C or Python) for complex objects like e.g.
There are probably more possibilities but if it works 2 is the better option IMO, it doesn't impact the current code too much and keeps things general. Although if the timestamp issue is the only one then perhaps the kludge should remain until numpy 1.7 comes along? |
Will have to return to this at some point. I am fairly certain a lot of performance is being left on the table due to all of the "boxing" of array values. The right approach as always is to add performance tests to the vbench suite (vb_suite/) so we can monitor and track the performance of to_json. I'm testing on both NumPy 1.6 and 1.7, so as long as the kludge works and the tests pass, good enough for me right now. |
@trottier this is in master now #3876 and docs: http://pandas.pydata.org/pandas-docs/dev/io.html#json |
Sorry, I should have been more explicit. I would recommend against using ujson in pandas because (and unfortunately this isn't documented) ujson handles floating point numbers unconventionally. ultrajson/ultrajson#69 (comment)
simplejson is almost as fast, and doesn't have these issues. |
Second attempt for JSON pull request (original #1226) for issue #631.
All tests pass apart from the two below, tested on 64 bit OSX and 32 bit Ubuntu.
Time stamp JSON encoding/decoding in test_frame and test_series fails. I haven't looked into this too much, as Wes mentioned the work with timestamps was still ongoing but this appears to occur because the code always operates on the underlying numpy array rather than the pandas objects and the numpy array returns a bad Python date object to be encoded.
For numpy 1.6
For numpy 1.7 the situation is improved and the encoding is correct (although the timestamps are in nanoseconds):