-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
PERF: json support for blocks GH9037 #9130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The key is that
lots of code changes! I will give a test on windows and let you know. Looks good though. Has the serialization order changed at all? Does it matter if it does? IOW for some formats the orderings might be different now. |
Thanks @jreback, I used I have not modified (or added) tests, which should enforce that serialisation order has not changed. Maybe I'll add a compat test though, just to be sure. FWIW the valgrind run for these changes is clean, (but I'll be submitting a PR soon to fix an unrelated segfault in the code for handling Python datetime.time objects) |
this passes everythng for me on windows, so good to go for me. lmk when you are satisfied with tests. aside: https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_json/test_ujson.py#L120 fails on windows (on master), but I wonder if its just a precision in the test issue (its like 1 part in 100 digits close), so it passes an |
cc @cpcloud |
+1000 here nice work. Sorry about the leak! |
What are your thoughts on a round-trippable json orient (maybe "roundtrip")? Ie provide enough metadata to reconstruct a frame or series with 100% fidelity. |
Alternatively how about exposing dumps at the toplevel, and folks could roll their own. |
@cpcloud a roundtrip orient sounds like a good idea, there was some related discussion in #4889 but I'm not keen on adding metadata into all the orients. But I like the notion of a new orient with the same format as I also like the idea of exposing json on the top level, as you can easily give it more than just frames and series types, i.e. it will happily process numpy arrays, pandas indices and other Python types quite efficiently. Maybe |
@jreback that windows issue is weird, what do you get when you try: In [6]: import pandas.json as ujson
In [7]: ujson.encode(1e-100)
Out[7]: '1e-100'
In [8]: ujson.decode('1e-100')
Out[8]: 1e-100 |
166a298
to
6751501
Compare
Ok I've added a compat test for completeness, removed the call to The only compat issue I can think of is that the json output for mixed frames will be slightly different due to integers being promoted to floats previously. e.g. v0.15.2 In [2]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}' this PR In [3]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
Out[3]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}' I've added an entry in the release notes about this too. |
6751501
to
f9ab3c4
Compare
@@ -28,6 +28,7 @@ Backwards incompatible API changes | |||
.. _whatsnew_0160.api_breaking: | |||
|
|||
- ``Index.duplicated`` now returns `np.array(dtype=bool)` rather than `Index(dtype=object)` containing `bool` values. (:issue:`8875`) | |||
- ``DataFrame.to_json`` now returns accurate type serialisation for each column for frames of mixed dtype (:issue:`9037`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe expand on this just a bit (you can in fact show the example of int/float conversions that you gave). You can do a code-block for both prior and current behavor if you want.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, done. Thanks!
@Komnomnomnom minor doc comment. ping when ready and we can merge. |
I tested 0.15.2 on win64 and
is indeed produced with the current PR
|
f9ab3c4
to
3cf9bcc
Compare
3cf9bcc
to
a67bef4
Compare
Ok just tried this myself on win64. I used conda and this article to compile with msvc. I had to make a couple of code fixes to get it to compile and it worked fine for The code for deserialisation wasn't changed by this PR and hasn't changed since 0.15.2 I think. Might be an issue with your compiler? What are you using, mingw? |
if you conda install libpython then u can use mingw out of the box to compile so prob a compiler issue then |
I'm not sure how conda is setup but I've had issues before with an extension compiled with mingw when the python core and other extensions were compiled with msvc. |
it's abi compat ok no biggie then |
ok works fine with msvc..... thanks @Komnomnomnom |
PERF: json support for blocks GH9037
@Komnomnomnom @cpcloud If you are interested in what is mentioned above (the 'roundtrip' orient, or the exposing of |
Thanks @jorisvandenbossche #9146 #9147 |
This adds block support to the JSON serialiser, as per #9037. I also added code to directly cast and serialise numpy data, which replaces the previous use of intermediate Python objects.
Large performance improvement (~25x) for mixed frames containing datetimes / timedeltas.
Some questions, any comments appreciated:
values
instead if the frame is 'simple'. I'm using the BlockManager methods_is_single_block
andis_mixed_dtype
for this to check for a simple frame._data
attr andmgr_locs
to get access to the block data and the block-to-column mapping. Are there any caveats to this? I know the DataFrame does some caching, but I'm not familiar enough with the details.Tested locally on Python 2.7 for 32 & 64 bit linux and 3.3. on 64 bit linux. JSON tests run through valgrind. Would appreciate if someone could give it a bash on Windows before merging.
@cpcloud I also fixed a ref leak and added support for
date_unit
in the #9028 code.