Skip to content

fix to_json for numbers larger than sys.maxsize #34473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 86 commits into from
Jun 24, 2020

Conversation

arw2019
Copy link
Member

@arw2019 arw2019 commented May 30, 2020

Currently this patch causes a significant reduction for a number of the JSON performance benchmarks. A printout of my results is included below.

json_benchmarks_results.txt

I'd love to keep working on this if anybody has ideas for making the solution more efficient!

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be handled in the ujson impl

@WillAyd
Copy link
Member

WillAyd commented May 30, 2020

If we do it in ujson would be around here:

if (exc && PyErr_ExceptionMatches(PyExc_OverflowError)) {

So instead of raising I think would convert to a buffer or string object and write that out. Would probably need to do something else on top of that to not add quotes in the output

@arw2019
Copy link
Member Author

arw2019 commented May 31, 2020

@WillAyd Thanks for this! I'm some of the way along the solution.

Inside the if statement that starts on line

if (exc && PyErr_ExceptionMatches(PyExc_OverflowError)) {

I convert obj into a string. This then needs be passed along for encoding.

Scrolling down objToJSON.c I understand that in this line

ret = JSON_EncodeObject(oinput, encoder, buffer, sizeof(buffer));

the objToJSON function has created an encoded string using the encoder and buffer objects. I feel like I'm not quite clear on how those work here... If I'm understanding correctly, the encoder is a struct with information about formatting and the buffer is where the output is actually stored. In that case I would want to figure out how to set the parameters of the encoder and how to dump my string into the buffer?

Thanks so much for the help with this!

@WillAyd
Copy link
Member

WillAyd commented May 31, 2020

You typically don't need to touch the buffer. At a high level the way it works is you set the type of object being serialized and there is a context associated with that type which the serializer will use to determine whether to continue recursion further (i.e. to keep nesting say a dictionary) or to simply write the object.

You can take any example, but if you look at the float code:

} else if (PyFloat_Check(obj)) {

We are introspecting the object, setting the appropriate doubleValue member (for the non null case) and telling the serializer that the type of the object is JT_DOUBLE. If you then look here you'll see ujson writing that value out to the buffer:

So for this, would need to figure out a way to write out a numeric value that exceeds C's storage limits for integers.

One possible way is to store the Python str representation of the value into the cStr member of the TypeContext struct much like we do for JT_UTF8 types, but instead of using UTF8 as the type (which will add quoting) create a new enum entry here that allows that wouldn't quote the output and add the appropriate code to the serializer for that type

@WillAyd
Copy link
Member

WillAyd commented Jun 3, 2020

Can you add a test for this? Will help guide the implementation and feedback

@arw2019
Copy link
Member Author

arw2019 commented Jun 3, 2020

Can you add a test for this? Will help guide the implementation and feedback

I added one in pandas/tests/io/json/test_ujson.py. I'm testing encoding and decoding against the in-built json library:

big_num = sys.maxsize + 1
encoding = ujson.dumps(big_num)

assert encoding == json.dumps(big_num)
assert ujson.loads(encoding) == big_num

@arw2019
Copy link
Member Author

arw2019 commented Jun 20, 2020

So the CI build is failing, with this message:

1.07s$ ci/run_tests.sh
xvfb-run pytest -m "(not slow and not network and not clipboard)" -n auto --dist=loadfile -s --strict --durations=30 --junitxml=test-data.xml pandas
ImportError while loading conftest '/home/travis/build/pandas-dev/pandas/pandas/conftest.py'.
pandas/__init__.py:34: in <module>
    raise ImportError(
E   ImportError: C extension: 'dtypes' from partially initialized module 'pandas._libs.tslibs' (most likely due to a circular import) (/home/travis/build/pandas-dev/pandas/pandas/_libs/tslibs/__init__.py) not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.
The command "ci/run_tests.sh" exited with 4.

I feel like the problem could be these lines in pandas/tests/io/json/test_pandas.py:

        originalSeries = Series(bigNum, dtype=object, index=["articleId"])
        originalDataFrame = DataFrame(
            bigNum, dtype=object, index=["articleId"], columns=[0]
        )

since Travis is complaining about the dtypes extension...?

@WillAyd
Copy link
Member

WillAyd commented Jun 20, 2020 via email

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you run the json perf tests to see if anything is affected. you may need to add some long ints there as well.

@arw2019
Copy link
Member Author

arw2019 commented Jun 22, 2020

@jreback Benchmark results are attached.
json_performance_benchmarks.txt

The ones which changed significantly are:

       before           after         ratio
     [7d0ee96f]       [9e1b95f7]
     <master>         <json-Overflow-long-int>
+         268±4ms         315±40ms     1.17  io.json.ToJSON.time_to_json('index', 'df_int_floats')
+         242±3ms          278±3ms     1.15  io.json.ToJSON.time_to_json('index', 'df_td_int_ts')
+        364±20ms         407±20ms     1.12  io.json.ToJSON.time_to_json_wide('split', 'df_int_floats')
-        706±30ms         606±10ms     0.86  io.json.ReadJSONLines.time_read_json_lines('int')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I'm working on adding long int benchmarks.

@@ -82,6 +84,7 @@ def setup(self, orient, frame):
timedeltas = timedelta_range(start=1, periods=N, freq="s")
datetimes = date_range(start=1, periods=N, freq="s")
ints = np.random.randint(100000000, size=N)
longints = sys.maxsize * np.random.randint(100000000, size=N)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback Not sure where we want the longints to go - whether into ints or separately. I would think we also want to make sure we have both positive and negative long ints in there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this prob fine

@@ -82,6 +84,7 @@ def setup(self, orient, frame):
timedeltas = timedelta_range(start=1, periods=N, freq="s")
datetimes = date_range(start=1, periods=N, freq="s")
ints = np.random.randint(100000000, size=N)
longints = sys.maxsize * np.random.randint(100000000, size=N)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this prob fine

@jreback jreback added this to the 1.1 milestone Jun 24, 2020
@jreback jreback added the Bug label Jun 24, 2020
@jreback
Copy link
Contributor

jreback commented Jun 24, 2020

looks ok to me. @WillAyd when you are happy.

@WillAyd WillAyd merged commit d85b93d into pandas-dev:master Jun 24, 2020
@WillAyd
Copy link
Member

WillAyd commented Jun 24, 2020

Thanks @arw2019 very nice PR

@arw2019
Copy link
Member Author

arw2019 commented Jun 25, 2020

Had a lot of fun doing this. @WillAyd @jreback thanks so much for all the help!

#34984 is a follow-up PR to deal with the correspondent problem in json.decode

@arw2019 arw2019 deleted the json-Overflow-long-int branch June 26, 2020 17:51
fangchenli pushed a commit to fangchenli/pandas that referenced this pull request Jun 27, 2020
* BUG: overflow on to_json with numbers larger than sys.maxsize

* TST: overflow on to_json with numbers larger than sys.maxsize (pandas-dev#34395)

* DOC: update with issue pandas-dev#34395

* TST: removed unused import

* ENH: added case JT_BIGNUM to encode

* ENH: added JT_BIGNUM to JSTYPES

* BUG: changed error for ints>sys.maxsize into JT_BIGNUM

* ENH: removed debug statements

* BUG: removed dumps wrapper

* removed bigNum from TypeContext

* TST: fixed bug in the test

* added pointer to string rep converter for BigNum

* TST: removed ujson.loads from the test

* added getBigNumStringValue

* added code to JT_BIGNUM handler by analogy with JT_UTF8

* TST: update pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <[email protected]>

* added Object_getBigNumStringValue to pyEncoder

* added skeletal code for Object_GetBigNumStringValue

* completed Object_getBigNumStringValue using PyObject_Repr

* BUG: changed Object_getBigNumStringValue

* improved Object_getBigNumStringValue some more

* update getBigNumStringValue argument

* corrected Object_getBigNumStringValue

* more fixes to Object_getBigNumStringValue

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <[email protected]>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <[email protected]>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <[email protected]>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <[email protected]>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <[email protected]>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <[email protected]>

* Update pandas/_libs/src/ujson/python/objToJSON.c

* Update pandas/_libs/src/ujson/python/objToJSON.c

* updated pyEncoder for JT_BIGNUM

* updated pyEncoder

* moved getBigNumStringValue to pyEncoder

* fixed declaration of Object_getBigNumStringValue

* fixed Object_getBigNumStringValue

* catch overflow error with PyLong_AsLongLongAndOverflow

* remove unnecessary error check

* added shortcircuit for error check

* simplify int overflow error catching

Co-authored-by: William Ayd <[email protected]>

* Update long int test in pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <[email protected]>

* removed tests expecting numeric overflow

* remove underscore from overflow

Co-authored-by: William Ayd <[email protected]>

* removed underscores from _overflow everywhere

* fixed small typo

* fix type of exc

* deleted numeric overflow tests

* remove extraneous condition in if statement

Co-authored-by: William Ayd <[email protected]>

* remove extraneous condition in if statement

Co-authored-by: William Ayd <[email protected]>

* change _Bool into int

Co-authored-by: William Ayd <[email protected]>

* Update pandas/_libs/src/ujson/python/objToJSON.c

Co-authored-by: William Ayd <[email protected]>

* Update pandas/_libs/src/ujson/lib/ultrajsonenc.c

Co-authored-by: William Ayd <[email protected]>

* allocate an extra byte in Object_getBigNumStringValue

Co-authored-by: William Ayd <[email protected]>

* allocate an extra byte in Object_getBigNumStringValue

Co-authored-by: William Ayd <[email protected]>

* reinstate RESERVE_STRING(szlen) in JT_BIGNUM case

* replaced (private) with (public) in whatnew

* release bytes in Object_endTypeContext

* in JT_BIGNUM change if+if into if+else if

* added reallocation of bigNum_bytes

* removed bigNum_bytes

* added to_json test for ints>sys.maxsize

* Use python malloc to match PyObject_Free in endTypeContext

Co-authored-by: William Ayd <[email protected]>

* TST: added manually constructed strs to compare encodings

* fixed styling to minimize diff with master

* fixed styling

* fixed conflicts with master

* fix styling to minimize diff

* fix styling to minimize diff

* fixed styling

* added negative nigNum to test_to_json_large_numers

* added negative nigNum to test_to_json_large_numers

* Update pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <[email protected]>

* fixe test_to_json_for_large_nums for -ve

* TST: added xfail for ujson.encode with long int input

* TST: fixed variable names in test_to_json_large_numbers

* TST: added xfail test for json.decode Series with long int

* TST: added xfail test for json.decode DataFrame with long int

* BENCH: added benchmarks for long ints

Co-authored-by: William Ayd <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO JSON read_json, to_json, json_normalize
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: OverflowError on to_json with numbers larger than sys.maxsize
3 participants