PERF: json support for blocks GH9037 #9130

Komnomnomnom · 2014-12-22T16:13:48Z

This adds block support to the JSON serialiser, as per #9037. I also added code to directly cast and serialise numpy data, which replaces the previous use of intermediate Python objects.

Large performance improvement (~25x) for mixed frames containing datetimes / timedeltas.

-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------
packers_write_json_mixed_delta_int_tstamp    |  97.1374 | 2633.7843 |   0.0369 |
packers_write_json_mixed_float_int_T         |  68.7390 |  86.4150 |   0.7955 |
packers_write_json_date_index                |  68.9886 |  83.2930 |   0.8283 |
packers_write_json_T                         |  61.1283 |  72.9477 |   0.8380 |
packers_write_json                           |  60.7293 |  71.3053 |   0.8517 |
packers_read_json_date_index                 | 157.4903 | 161.6703 |   0.9741 |
packers_read_json                            | 157.4220 | 157.8477 |   0.9973 |
packers_write_json_mixed_float_int           |  98.1897 |  98.1696 |   1.0002 |
packers_write_json_mixed_float_int_str       |  84.0390 |  83.6937 |   1.0041 |
-------------------------------------------------------------------------------
Test name                                    | head[ms] | base[ms] |  ratio   |
-------------------------------------------------------------------------------

Some questions, any comments appreciated:

there's a little overhead to dealing with blocks so I'm avoiding using them and using values instead if the frame is 'simple'. I'm using the BlockManager methods _is_single_block and is_mixed_dtype for this to check for a simple frame.
I'm using the DataFrame _data attr and mgr_locs to get access to the block data and the block-to-column mapping. Are there any caveats to this? I know the DataFrame does some caching, but I'm not familiar enough with the details.

Tested locally on Python 2.7 for 32 & 64 bit linux and 3.3. on 64 bit linux. JSON tests run through valgrind. Would appreciate if someone could give it a bash on Windows before merging.

@cpcloud I also fixed a ref leak and added support for date_unit in the #9028 code.

jreback · 2014-12-22T21:50:40Z

@Komnomnomnom

you can really just use _is_mixed_type. _is_single_block distinguished between a ndim==1 and ndim>1 (e.g. a Series and DataFrame).

The key is that .values on a non-mixed_type (e.g. a Series or a single-dtyped DataFrames) is free (no copyin data), and you will get back a single dtype.

_data is guaranteed to exists on a PandasObject (e.g. Series/DataFrame) as well as mgr_locs (which is also correct), this handles the position in the block to the column names mappings.

lots of code changes! I will give a test on windows and let you know.

Looks good though.

Has the serialization order changed at all? Does it matter if it does? IOW for some formats the orderings might be different now.

Komnomnomnom · 2014-12-22T23:14:52Z

Thanks @jreback, I used ._data.mgr_locs rather than .blocks so I could preserve the column order (also .blocks on the frame returns a copy I think?).

I have not modified (or added) tests, which should enforce that serialisation order has not changed. Maybe I'll add a compat test though, just to be sure.

FWIW the valgrind run for these changes is clean, (but I'll be submitting a PR soon to fix an unrelated segfault in the code for handling Python datetime.time objects)

jreback · 2014-12-23T00:28:06Z

this passes everythng for me on windows, so good to go for me. lmk when you are satisfied with tests.

aside: https://github.com/pydata/pandas/blob/master/pandas/io/tests/test_json/test_ujson.py#L120 fails on windows (on master), but I wonder if its just a precision in the test issue (its like 1 part in 100 digits close), so it passes an .allclose test

jreback · 2014-12-23T10:50:13Z

cc @cpcloud

cpcloud · 2014-12-23T11:52:25Z

+1000 here nice work. Sorry about the leak!

cpcloud · 2014-12-23T11:54:57Z

@Komnomnomnom

What are your thoughts on a round-trippable json orient (maybe "roundtrip")? Ie provide enough metadata to reconstruct a frame or series with 100% fidelity.

cpcloud · 2014-12-23T11:55:43Z

Alternatively how about exposing dumps at the toplevel, and folks could roll their own.

Komnomnomnom · 2014-12-23T16:18:41Z

@cpcloud a roundtrip orient sounds like a good idea, there was some related discussion in #4889 but I'm not keen on adding metadata into all the orients. But I like the notion of a new orient with the same format as split but an additional _meta entry with info on blocks and their dtypes. Would it need any other info to roundtrip properly? I've been thinking about a couple more orients too (from #8333 and #5729).

I also like the idea of exposing json on the top level, as you can easily give it more than just frames and series types, i.e. it will happily process numpy arrays, pandas indices and other Python types quite efficiently. Maybe pd.to_json ?

Komnomnomnom · 2014-12-23T21:54:18Z

@jreback that windows issue is weird, what do you get when you try:

In [6]: import pandas.json as ujson

In [7]: ujson.encode(1e-100)
Out[7]: '1e-100'

In [8]: ujson.decode('1e-100')
Out[8]: 1e-100

Komnomnomnom · 2014-12-23T22:19:00Z

Ok I've added a compat test for completeness, removed the call to _is_single_block and updated the release notes.

The only compat issue I can think of is that the json output for mixed frames will be slightly different due to integers being promoted to floats previously.

e.g. v0.15.2

In [2]: pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
Out[2]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1.0,"1":2.0}}'

this PR

In [3]:  pd.DataFrame({'i': [1,2], 'f': [3.0, 4.2]}).to_json()
Out[3]: '{"f":{"0":3.0,"1":4.2},"i":{"0":1,"1":2}}'

I've added an entry in the release notes about this too.

jreback · 2014-12-24T02:26:35Z

doc/source/whatsnew/v0.16.0.txt

@@ -28,6 +28,7 @@ Backwards incompatible API changes
 .. _whatsnew_0160.api_breaking:

 - ``Index.duplicated`` now returns `np.array(dtype=bool)` rather than `Index(dtype=object)` containing `bool` values. (:issue:`8875`)
+- ``DataFrame.to_json`` now returns accurate type serialisation for each column for frames of mixed dtype (:issue:`9037`)


maybe expand on this just a bit (you can in fact show the example of int/float conversions that you gave). You can do a code-block for both prior and current behavor if you want.

Ok, done. Thanks!

jreback · 2014-12-24T02:27:13Z

@Komnomnomnom minor doc comment. ping when ready and we can merge.

jreback · 2014-12-24T02:33:18Z

@Komnomnomnom

I tested 0.15.2 on win64 and

In [8]: ujson.decode('1e-100')
Out[8]: 1e-100

is indeed produced

with the current PR

9.999999999999999999998e-101 is produced
very odd

Komnomnomnom · 2014-12-24T13:57:46Z

Ok just tried this myself on win64. I used conda and this article to compile with msvc. I had to make a couple of code fixes to get it to compile and it worked fine for ujson.decode('1e-100') for me.....

The code for deserialisation wasn't changed by this PR and hasn't changed since 0.15.2 I think. Might be an issue with your compiler? What are you using, mingw?

jreback · 2014-12-24T14:01:45Z

if you conda install libpython then u can use mingw out of the box to compile

so prob a compiler issue then
ok maybe should skip this test on windows (or at least that part)?

Komnomnomnom · 2014-12-24T14:05:30Z

I'm not sure how conda is setup but I've had issues before with an extension compiled with mingw when the python core and other extensions were compiled with msvc.

jreback · 2014-12-24T14:07:54Z

it's abi compat
and works for me (I think older versions might have that issue)

ok no biggie then

jreback · 2014-12-24T15:53:13Z

ok works fine with msvc.....

thanks @Komnomnomnom

PERF: json support for blocks GH9037

jorisvandenbossche · 2014-12-24T18:16:59Z

@Komnomnomnom @cpcloud If you are interested in what is mentioned above (the 'roundtrip' orient, or the exposing of dumps or a to_json in the toplevel namespave), maybe open a new issue for that?

Komnomnomnom · 2014-12-24T18:27:12Z

Thanks @jorisvandenbossche #9146 #9147

jreback added IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels Dec 22, 2014

jreback added this to the 0.16.0 milestone Dec 22, 2014

Komnomnomnom force-pushed the json-block-support branch from 166a298 to 6751501 Compare December 23, 2014 22:13

Komnomnomnom force-pushed the json-block-support branch from 6751501 to f9ab3c4 Compare December 23, 2014 22:51

jreback reviewed Dec 24, 2014
View reviewed changes

Komnomnomnom force-pushed the json-block-support branch from f9ab3c4 to 3cf9bcc Compare December 24, 2014 13:18

PERF: json support for blocks GH9037

a67bef4

Komnomnomnom force-pushed the json-block-support branch from 3cf9bcc to a67bef4 Compare December 24, 2014 13:57

jreback added a commit that referenced this pull request Dec 24, 2014

Merge pull request #9130 from Komnomnomnom/json-block-support

9b453e0

PERF: json support for blocks GH9037

jreback merged commit 9b453e0 into pandas-dev:master Dec 24, 2014

jreback mentioned this pull request Dec 24, 2014

PERF: json should process data column-by-column (and not use .values) #9037

Closed

This was referenced Dec 24, 2014

Roundtrip orient for JSON #9146

Closed

Add a to_json method to top-level namespace #9147

Closed

Komnomnomnom deleted the json-block-support branch December 24, 2014 18:27

lJoublanc mentioned this pull request Dec 29, 2014

CLN: Refactor *SON RFC #9166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: json support for blocks GH9037 #9130

PERF: json support for blocks GH9037 #9130

Komnomnomnom commented Dec 22, 2014

jreback commented Dec 22, 2014

Komnomnomnom commented Dec 22, 2014

jreback commented Dec 23, 2014

jreback commented Dec 23, 2014

cpcloud commented Dec 23, 2014

cpcloud commented Dec 23, 2014

cpcloud commented Dec 23, 2014

Komnomnomnom commented Dec 23, 2014

Komnomnomnom commented Dec 23, 2014

Komnomnomnom commented Dec 23, 2014

jreback Dec 24, 2014

Komnomnomnom Dec 24, 2014

jreback commented Dec 24, 2014

jreback commented Dec 24, 2014

Komnomnomnom commented Dec 24, 2014

jreback commented Dec 24, 2014

Komnomnomnom commented Dec 24, 2014

jreback commented Dec 24, 2014

jreback commented Dec 24, 2014

jorisvandenbossche commented Dec 24, 2014

Komnomnomnom commented Dec 24, 2014

PERF: json support for blocks GH9037 #9130

PERF: json support for blocks GH9037 #9130

Conversation

Komnomnomnom commented Dec 22, 2014

jreback commented Dec 22, 2014

Komnomnomnom commented Dec 22, 2014

jreback commented Dec 23, 2014

jreback commented Dec 23, 2014

cpcloud commented Dec 23, 2014

cpcloud commented Dec 23, 2014

cpcloud commented Dec 23, 2014

Komnomnomnom commented Dec 23, 2014

Komnomnomnom commented Dec 23, 2014

Komnomnomnom commented Dec 23, 2014

jreback Dec 24, 2014

Choose a reason for hiding this comment

Komnomnomnom Dec 24, 2014

Choose a reason for hiding this comment

jreback commented Dec 24, 2014

jreback commented Dec 24, 2014

Komnomnomnom commented Dec 24, 2014

jreback commented Dec 24, 2014

Komnomnomnom commented Dec 24, 2014

jreback commented Dec 24, 2014

jreback commented Dec 24, 2014

jorisvandenbossche commented Dec 24, 2014

Komnomnomnom commented Dec 24, 2014