fix to_json for numbers larger than sys.maxsize #34473

arw2019 · 2020-05-30T00:40:35Z

closes BUG: OverflowError on to_json with numbers larger than sys.maxsize #34395
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
passes performance benchmarks
whatsnew entry

Currently this patch causes a significant reduction for a number of the JSON performance benchmarks. A printout of my results is included below.

json_benchmarks_results.txt

I'd love to keep working on this if anybody has ideas for making the solution more efficient!

…-dev#34395)

jreback

this needs to be handled in the ujson impl

WillAyd · 2020-05-30T01:14:29Z

If we do it in ujson would be around here:

pandas/pandas/_libs/src/ujson/python/objToJSON.c

Line 1636 in 1cad9e5

if (exc && PyErr_ExceptionMatches(PyExc_OverflowError)) {

So instead of raising I think would convert to a buffer or string object and write that out. Would probably need to do something else on top of that to not add quotes in the output

arw2019 · 2020-05-31T02:24:54Z

@WillAyd Thanks for this! I'm some of the way along the solution.

Inside the if statement that starts on line

pandas/pandas/_libs/src/ujson/python/objToJSON.c

Line 1636 in 1cad9e5

if (exc && PyErr_ExceptionMatches(PyExc_OverflowError)) {

I convert obj into a string. This then needs be passed along for encoding.

Scrolling down objToJSON.c I understand that in this line

pandas/pandas/_libs/src/ujson/python/objToJSON.c

Line 2285 in 1cad9e5

ret = JSON_EncodeObject(oinput, encoder, buffer, sizeof(buffer));

the objToJSON function has created an encoded string using the encoder and buffer objects. I feel like I'm not quite clear on how those work here... If I'm understanding correctly, the encoder is a struct with information about formatting and the buffer is where the output is actually stored. In that case I would want to figure out how to set the parameters of the encoder and how to dump my string into the buffer?

Thanks so much for the help with this!

WillAyd · 2020-05-31T17:17:57Z

You typically don't need to touch the buffer. At a high level the way it works is you set the type of object being serialized and there is a context associated with that type which the serializer will use to determine whether to continue recursion further (i.e. to keep nesting say a dictionary) or to simply write the object.

You can take any example, but if you look at the float code:

pandas/pandas/_libs/src/ujson/python/objToJSON.c

Line 1642 in d5e8edc

} else if (PyFloat_Check(obj)) {

We are introspecting the object, setting the appropriate doubleValue member (for the non null case) and telling the serializer that the type of the object is JT_DOUBLE. If you then look here you'll see ujson writing that value out to the buffer:

pandas/pandas/_libs/src/ujson/lib/ultrajsonenc.c

Line 1072 in d5e8edc

case JT_DOUBLE: {

So for this, would need to figure out a way to write out a numeric value that exceeds C's storage limits for integers.

One possible way is to store the Python str representation of the value into the cStr member of the TypeContext struct much like we do for JT_UTF8 types, but instead of using UTF8 as the type (which will add quoting) create a new enum entry here that allows that wouldn't quote the output and add the appropriate code to the serializer for that type

pandas/pandas/_libs/src/ujson/lib/ultrajson.h

Line 146 in d5e8edc

enum JSTYPES {

pandas/_libs/src/ujson/lib/ultrajsonenc.c

pandas/_libs/src/ujson/python/objToJSON.c

WillAyd · 2020-06-03T16:41:05Z

Can you add a test for this? Will help guide the implementation and feedback

arw2019 · 2020-06-03T22:05:55Z

Can you add a test for this? Will help guide the implementation and feedback

I added one in pandas/tests/io/json/test_ujson.py. I'm testing encoding and decoding against the in-built json library:

big_num = sys.maxsize + 1
encoding = ujson.dumps(big_num)

assert encoding == json.dumps(big_num)
assert ujson.loads(encoding) == big_num

pandas/tests/io/json/test_ujson.py

pandas/_libs/src/ujson/python/objToJSON.c

pandas/tests/io/json/test_ujson.py

Co-authored-by: William Ayd <[email protected]>

pandas/_libs/src/ujson/python/objToJSON.c

pandas/tests/io/json/test_ujson.py

pandas/tests/io/json/test_pandas.py

Co-authored-by: William Ayd <[email protected]>

arw2019 · 2020-06-20T02:11:28Z

So the CI build is failing, with this message:

1.07s$ ci/run_tests.sh
xvfb-run pytest -m "(not slow and not network and not clipboard)" -n auto --dist=loadfile -s --strict --durations=30 --junitxml=test-data.xml pandas
ImportError while loading conftest '/home/travis/build/pandas-dev/pandas/pandas/conftest.py'.
pandas/__init__.py:34: in <module>
    raise ImportError(
E   ImportError: C extension: 'dtypes' from partially initialized module 'pandas._libs.tslibs' (most likely due to a circular import) (/home/travis/build/pandas-dev/pandas/pandas/_libs/tslibs/__init__.py) not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.
The command "ci/run_tests.sh" exited with 4.

I feel like the problem could be these lines in pandas/tests/io/json/test_pandas.py:

        originalSeries = Series(bigNum, dtype=object, index=["articleId"])

        originalDataFrame = DataFrame(
            bigNum, dtype=object, index=["articleId"], columns=[0]
        )

since Travis is complaining about the dtypes extension...?

WillAyd · 2020-06-20T15:39:08Z

Can you merge master and repush? I don’t think that error is related to changes made here

…

Sent from my iPhone

On Jun 19, 2020, at 7:11 PM, Andrew Wieteska ***@***.***> wrote: So the CI build is failing, with this message: 1.07s$ ci/run_tests.sh xvfb-run pytest -m "(not slow and not network and not clipboard)" -n auto --dist=loadfile -s --strict --durations=30 --junitxml=test-data.xml pandas ImportError while loading conftest '/home/travis/build/pandas-dev/pandas/pandas/conftest.py'. pandas/__init__.py:34: in <module> raise ImportError( E ImportError: C extension: 'dtypes' from partially initialized module 'pandas._libs.tslibs' (most likely due to a circular import) (/home/travis/build/pandas-dev/pandas/pandas/_libs/tslibs/__init__.py) not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first. The command "ci/run_tests.sh" exited with 4. I feel like the problem could be these lines in pandas/tests/io/json/test_pandas.py: originalSeries = Series(bigNum, dtype=object, index=["articleId"]) originalDataFrame = DataFrame( bigNum, dtype=object, index=["articleId"], columns=[0] ) since Travis is complaining about the dtypes extension...? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

jreback

can you run the json perf tests to see if anything is affected. you may need to add some long ints there as well.

pandas/tests/io/json/test_pandas.py

pandas/tests/io/json/test_ujson.py

arw2019 · 2020-06-22T08:13:51Z

@jreback Benchmark results are attached.
json_performance_benchmarks.txt

The ones which changed significantly are:

       before           after         ratio
     [7d0ee96f]       [9e1b95f7]
     <master>         <json-Overflow-long-int>
+         268±4ms         315±40ms     1.17  io.json.ToJSON.time_to_json('index', 'df_int_floats')
+         242±3ms          278±3ms     1.15  io.json.ToJSON.time_to_json('index', 'df_td_int_ts')
+        364±20ms         407±20ms     1.12  io.json.ToJSON.time_to_json_wide('split', 'df_int_floats')
-        706±30ms         606±10ms     0.86  io.json.ReadJSONLines.time_read_json_lines('int')

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.

I'm working on adding long int benchmarks.

arw2019 · 2020-06-22T20:48:54Z

asv_bench/benchmarks/io/json.py

@@ -82,6 +84,7 @@ def setup(self, orient, frame):
        timedeltas = timedelta_range(start=1, periods=N, freq="s")
        datetimes = date_range(start=1, periods=N, freq="s")
        ints = np.random.randint(100000000, size=N)
+        longints = sys.maxsize * np.random.randint(100000000, size=N)


@jreback Not sure where we want the longints to go - whether into ints or separately. I would think we also want to make sure we have both positive and negative long ints in there?

this prob fine

jreback · 2020-06-24T23:38:53Z

asv_bench/benchmarks/io/json.py

@@ -82,6 +84,7 @@ def setup(self, orient, frame):
        timedeltas = timedelta_range(start=1, periods=N, freq="s")
        datetimes = date_range(start=1, periods=N, freq="s")
        ints = np.random.randint(100000000, size=N)
+        longints = sys.maxsize * np.random.randint(100000000, size=N)


this prob fine

jreback · 2020-06-24T23:41:26Z

looks ok to me. @WillAyd when you are happy.

WillAyd · 2020-06-24T23:45:05Z

Thanks @arw2019 very nice PR

arw2019 · 2020-06-25T07:48:55Z

Had a lot of fun doing this. @WillAyd @jreback thanks so much for all the help!

#34984 is a follow-up PR to deal with the correspondent problem in json.decode

* BUG: overflow on to_json with numbers larger than sys.maxsize * TST: overflow on to_json with numbers larger than sys.maxsize (pandas-dev#34395) * DOC: update with issue pandas-dev#34395 * TST: removed unused import * ENH: added case JT_BIGNUM to encode * ENH: added JT_BIGNUM to JSTYPES * BUG: changed error for ints>sys.maxsize into JT_BIGNUM * ENH: removed debug statements * BUG: removed dumps wrapper * removed bigNum from TypeContext * TST: fixed bug in the test * added pointer to string rep converter for BigNum * TST: removed ujson.loads from the test * added getBigNumStringValue * added code to JT_BIGNUM handler by analogy with JT_UTF8 * TST: update pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * added Object_getBigNumStringValue to pyEncoder * added skeletal code for Object_GetBigNumStringValue * completed Object_getBigNumStringValue using PyObject_Repr * BUG: changed Object_getBigNumStringValue * improved Object_getBigNumStringValue some more * update getBigNumStringValue argument * corrected Object_getBigNumStringValue * more fixes to Object_getBigNumStringValue * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c * Update pandas/_libs/src/ujson/python/objToJSON.c * updated pyEncoder for JT_BIGNUM * updated pyEncoder * moved getBigNumStringValue to pyEncoder * fixed declaration of Object_getBigNumStringValue * fixed Object_getBigNumStringValue * catch overflow error with PyLong_AsLongLongAndOverflow * remove unnecessary error check * added shortcircuit for error check * simplify int overflow error catching Co-authored-by: William Ayd <[email protected]> * Update long int test in pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * removed tests expecting numeric overflow * remove underscore from overflow Co-authored-by: William Ayd <[email protected]> * removed underscores from _overflow everywhere * fixed small typo * fix type of exc * deleted numeric overflow tests * remove extraneous condition in if statement Co-authored-by: William Ayd <[email protected]> * remove extraneous condition in if statement Co-authored-by: William Ayd <[email protected]> * change _Bool into int Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/lib/ultrajsonenc.c Co-authored-by: William Ayd <[email protected]> * allocate an extra byte in Object_getBigNumStringValue Co-authored-by: William Ayd <[email protected]> * allocate an extra byte in Object_getBigNumStringValue Co-authored-by: William Ayd <[email protected]> * reinstate RESERVE_STRING(szlen) in JT_BIGNUM case * replaced (private) with (public) in whatnew * release bytes in Object_endTypeContext * in JT_BIGNUM change if+if into if+else if * added reallocation of bigNum_bytes * removed bigNum_bytes * added to_json test for ints>sys.maxsize * Use python malloc to match PyObject_Free in endTypeContext Co-authored-by: William Ayd <[email protected]> * TST: added manually constructed strs to compare encodings * fixed styling to minimize diff with master * fixed styling * fixed conflicts with master * fix styling to minimize diff * fix styling to minimize diff * fixed styling * added negative nigNum to test_to_json_large_numers * added negative nigNum to test_to_json_large_numers * Update pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * fixe test_to_json_for_large_nums for -ve * TST: added xfail for ujson.encode with long int input * TST: fixed variable names in test_to_json_large_numbers * TST: added xfail test for json.decode Series with long int * TST: added xfail test for json.decode DataFrame with long int * BENCH: added benchmarks for long ints Co-authored-by: William Ayd <[email protected]>

arw2019 added 3 commits May 29, 2020 23:47

BUG: overflow on to_json with numbers larger than sys.maxsize

95c20db

TST: overflow on to_json with numbers larger than sys.maxsize (pandas…

6d2f8bd

…-dev#34395)

DOC: update with issue pandas-dev#34395

4fc5b87

jreback requested changes May 30, 2020

View reviewed changes

arw2019 added 6 commits June 3, 2020 02:30

TST: removed unused import

abfca37

ENH: added case JT_BIGNUM to encode

7e63941

ENH: added JT_BIGNUM to JSTYPES

3353420

BUG: changed error for ints>sys.maxsize into JT_BIGNUM

c9574b8

ENH: removed debug statements

94c112f

BUG: removed dumps wrapper

76576b8

WillAyd reviewed Jun 3, 2020

View reviewed changes

pandas/_libs/src/ujson/lib/ultrajsonenc.c Show resolved Hide resolved

WillAyd reviewed Jun 3, 2020

View reviewed changes

pandas/_libs/src/ujson/python/objToJSON.c Outdated Show resolved Hide resolved

removed bigNum from TypeContext

9f211a5

arw2019 added 2 commits June 3, 2020 22:07

TST: fixed bug in the test

2b7a271

added pointer to string rep converter for BigNum

5e06109

WillAyd reviewed Jun 3, 2020

View reviewed changes

pandas/tests/io/json/test_ujson.py Outdated Show resolved Hide resolved

arw2019 added 3 commits June 3, 2020 23:04

TST: removed ujson.loads from the test

755ef47

added getBigNumStringValue

0e768f8

added code to JT_BIGNUM handler by analogy with JT_UTF8

12d73b0

WillAyd reviewed Jun 4, 2020

View reviewed changes

pandas/_libs/src/ujson/python/objToJSON.c Outdated Show resolved Hide resolved

pandas/_libs/src/ujson/python/objToJSON.c Outdated Show resolved Hide resolved

pandas/tests/io/json/test_ujson.py Outdated Show resolved Hide resolved

arw2019 and others added 2 commits June 5, 2020 23:31

TST: update pandas/tests/io/json/test_ujson.py

6c2aa9f

Co-authored-by: William Ayd <[email protected]>

added Object_getBigNumStringValue to pyEncoder

1a8051f

WillAyd reviewed Jun 6, 2020

View reviewed changes

pandas/_libs/src/ujson/python/objToJSON.c Outdated Show resolved Hide resolved

added skeletal code for Object_GetBigNumStringValue

552194e

WillAyd reviewed Jun 7, 2020

View reviewed changes

pandas/_libs/src/ujson/python/objToJSON.c Outdated Show resolved Hide resolved

added negative nigNum to test_to_json_large_numers

e4df0f8

WillAyd reviewed Jun 19, 2020

View reviewed changes

pandas/tests/io/json/test_ujson.py Outdated Show resolved Hide resolved

WillAyd reviewed Jun 19, 2020

View reviewed changes

pandas/tests/io/json/test_pandas.py Outdated Show resolved Hide resolved

arw2019 and others added 2 commits June 19, 2020 19:07

Update pandas/tests/io/json/test_ujson.py

7b041fe

Co-authored-by: William Ayd <[email protected]>

fixe test_to_json_for_large_nums for -ve

21d9e98

merge with master

6fd15c9

jreback requested changes Jun 20, 2020

View reviewed changes

pandas/tests/io/json/test_pandas.py Outdated Show resolved Hide resolved

pandas/tests/io/json/test_ujson.py Outdated Show resolved Hide resolved

arw2019 added 5 commits June 21, 2020 06:46

merged with master

c7acef1

TST: added xfail for ujson.encode with long int input

2d43001

TST: fixed variable names in test_to_json_large_numbers

6053227

TST: added xfail test for json.decode Series with long int

a688468

TST: added xfail test for json.decode DataFrame with long int

9e1b95f

arw2019 added 2 commits June 22, 2020 08:15

merge with master

cc0dd6a

BENCH: added benchmarks for long ints

4e53974

arw2019 commented Jun 22, 2020

View reviewed changes

Merge branch 'master' into json-Overflow-long-int

5c96ae4

jreback approved these changes Jun 24, 2020

View reviewed changes

jreback added this to the 1.1 milestone Jun 24, 2020

jreback added the Bug label Jun 24, 2020

WillAyd merged commit d85b93d into pandas-dev:master Jun 24, 2020

arw2019 mentioned this pull request Jun 25, 2020

BUG: json.decode fails for nums larger than sys.maxsize #34984

Closed

6 tasks

arw2019 deleted the json-Overflow-long-int branch June 26, 2020 17:51

TomAugspurger mentioned this pull request Jul 6, 2020

CI: MacPython failing TestPandasContainer.test_to_json_large_numbers #35147

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix to_json for numbers larger than sys.maxsize #34473

fix to_json for numbers larger than sys.maxsize #34473

arw2019 commented May 30, 2020 •

edited

Loading

jreback left a comment

WillAyd commented May 30, 2020

arw2019 commented May 31, 2020

WillAyd commented May 31, 2020

WillAyd commented Jun 3, 2020

arw2019 commented Jun 3, 2020

arw2019 commented Jun 20, 2020

WillAyd commented Jun 20, 2020 via email

jreback left a comment

arw2019 commented Jun 22, 2020

arw2019 Jun 22, 2020

jreback Jun 24, 2020

jreback Jun 24, 2020

jreback commented Jun 24, 2020

WillAyd commented Jun 24, 2020

arw2019 commented Jun 25, 2020

fix to_json for numbers larger than sys.maxsize #34473

fix to_json for numbers larger than sys.maxsize #34473

Conversation

arw2019 commented May 30, 2020 • edited Loading

jreback left a comment

Choose a reason for hiding this comment

WillAyd commented May 30, 2020

arw2019 commented May 31, 2020

WillAyd commented May 31, 2020

WillAyd commented Jun 3, 2020

arw2019 commented Jun 3, 2020

arw2019 commented Jun 20, 2020

WillAyd commented Jun 20, 2020 via email

jreback left a comment

Choose a reason for hiding this comment

arw2019 commented Jun 22, 2020

arw2019 Jun 22, 2020

Choose a reason for hiding this comment

jreback Jun 24, 2020

Choose a reason for hiding this comment

jreback Jun 24, 2020

Choose a reason for hiding this comment

jreback commented Jun 24, 2020

WillAyd commented Jun 24, 2020

arw2019 commented Jun 25, 2020

arw2019 commented May 30, 2020 •

edited

Loading