-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
fix to_json for numbers larger than sys.maxsize #34473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs to be handled in the ujson impl
If we do it in ujson would be around here: pandas/pandas/_libs/src/ujson/python/objToJSON.c Line 1636 in 1cad9e5
So instead of raising I think would convert to a buffer or string object and write that out. Would probably need to do something else on top of that to not add quotes in the output |
@WillAyd Thanks for this! I'm some of the way along the solution. Inside the if statement that starts on line pandas/pandas/_libs/src/ujson/python/objToJSON.c Line 1636 in 1cad9e5
I convert obj into a string. This then needs be passed along for encoding.
Scrolling down pandas/pandas/_libs/src/ujson/python/objToJSON.c Line 2285 in 1cad9e5
the objToJSON function has created an encoded string using the encoder and buffer objects. I feel like I'm not quite clear on how those work here... If I'm understanding correctly, the encoder is a struct with information about formatting and the buffer is where the output is actually stored. In that case I would want to figure out how to set the parameters of the encoder and how to dump my string into the buffer?
Thanks so much for the help with this! |
You typically don't need to touch the buffer. At a high level the way it works is you set the type of object being serialized and there is a context associated with that type which the serializer will use to determine whether to continue recursion further (i.e. to keep nesting say a dictionary) or to simply write the object. You can take any example, but if you look at the float code: pandas/pandas/_libs/src/ujson/python/objToJSON.c Line 1642 in d5e8edc
We are introspecting the object, setting the appropriate pandas/pandas/_libs/src/ujson/lib/ultrajsonenc.c Line 1072 in d5e8edc
So for this, would need to figure out a way to write out a numeric value that exceeds C's storage limits for integers. One possible way is to store the Python str representation of the value into the pandas/pandas/_libs/src/ujson/lib/ultrajson.h Line 146 in d5e8edc
|
Can you add a test for this? Will help guide the implementation and feedback |
I added one in big_num = sys.maxsize + 1
encoding = ujson.dumps(big_num)
assert encoding == json.dumps(big_num)
assert ujson.loads(encoding) == big_num |
Co-authored-by: William Ayd <[email protected]>
Co-authored-by: William Ayd <[email protected]>
So the CI build is failing, with this message:
I feel like the problem could be these lines in
since Travis is complaining about the |
Can you merge master and repush? I don’t think that error is related to changes made here
…Sent from my iPhone
On Jun 19, 2020, at 7:11 PM, Andrew Wieteska ***@***.***> wrote:
So the CI build is failing, with this message:
1.07s$ ci/run_tests.sh
xvfb-run pytest -m "(not slow and not network and not clipboard)" -n auto --dist=loadfile -s --strict --durations=30 --junitxml=test-data.xml pandas
ImportError while loading conftest '/home/travis/build/pandas-dev/pandas/pandas/conftest.py'.
pandas/__init__.py:34: in <module>
raise ImportError(
E ImportError: C extension: 'dtypes' from partially initialized module 'pandas._libs.tslibs' (most likely due to a circular import) (/home/travis/build/pandas-dev/pandas/pandas/_libs/tslibs/__init__.py) not built. If you want to import pandas from the source directory, you may need to run 'python setup.py build_ext --inplace --force' to build the C extensions first.
The command "ci/run_tests.sh" exited with 4.
I feel like the problem could be these lines in pandas/tests/io/json/test_pandas.py:
originalSeries = Series(bigNum, dtype=object, index=["articleId"])
originalDataFrame = DataFrame(
bigNum, dtype=object, index=["articleId"], columns=[0]
)
since Travis is complaining about the dtypes extension...?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or unsubscribe.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you run the json perf tests to see if anything is affected. you may need to add some long ints there as well.
@jreback Benchmark results are attached. The ones which changed significantly are:
I'm working on adding long int benchmarks. |
@@ -82,6 +84,7 @@ def setup(self, orient, frame): | |||
timedeltas = timedelta_range(start=1, periods=N, freq="s") | |||
datetimes = date_range(start=1, periods=N, freq="s") | |||
ints = np.random.randint(100000000, size=N) | |||
longints = sys.maxsize * np.random.randint(100000000, size=N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback Not sure where we want the longints
to go - whether into ints
or separately. I would think we also want to make sure we have both positive and negative long ints in there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this prob fine
@@ -82,6 +84,7 @@ def setup(self, orient, frame): | |||
timedeltas = timedelta_range(start=1, periods=N, freq="s") | |||
datetimes = date_range(start=1, periods=N, freq="s") | |||
ints = np.random.randint(100000000, size=N) | |||
longints = sys.maxsize * np.random.randint(100000000, size=N) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this prob fine
looks ok to me. @WillAyd when you are happy. |
Thanks @arw2019 very nice PR |
* BUG: overflow on to_json with numbers larger than sys.maxsize * TST: overflow on to_json with numbers larger than sys.maxsize (pandas-dev#34395) * DOC: update with issue pandas-dev#34395 * TST: removed unused import * ENH: added case JT_BIGNUM to encode * ENH: added JT_BIGNUM to JSTYPES * BUG: changed error for ints>sys.maxsize into JT_BIGNUM * ENH: removed debug statements * BUG: removed dumps wrapper * removed bigNum from TypeContext * TST: fixed bug in the test * added pointer to string rep converter for BigNum * TST: removed ujson.loads from the test * added getBigNumStringValue * added code to JT_BIGNUM handler by analogy with JT_UTF8 * TST: update pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * added Object_getBigNumStringValue to pyEncoder * added skeletal code for Object_GetBigNumStringValue * completed Object_getBigNumStringValue using PyObject_Repr * BUG: changed Object_getBigNumStringValue * improved Object_getBigNumStringValue some more * update getBigNumStringValue argument * corrected Object_getBigNumStringValue * more fixes to Object_getBigNumStringValue * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c * Update pandas/_libs/src/ujson/python/objToJSON.c * updated pyEncoder for JT_BIGNUM * updated pyEncoder * moved getBigNumStringValue to pyEncoder * fixed declaration of Object_getBigNumStringValue * fixed Object_getBigNumStringValue * catch overflow error with PyLong_AsLongLongAndOverflow * remove unnecessary error check * added shortcircuit for error check * simplify int overflow error catching Co-authored-by: William Ayd <[email protected]> * Update long int test in pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * removed tests expecting numeric overflow * remove underscore from overflow Co-authored-by: William Ayd <[email protected]> * removed underscores from _overflow everywhere * fixed small typo * fix type of exc * deleted numeric overflow tests * remove extraneous condition in if statement Co-authored-by: William Ayd <[email protected]> * remove extraneous condition in if statement Co-authored-by: William Ayd <[email protected]> * change _Bool into int Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/python/objToJSON.c Co-authored-by: William Ayd <[email protected]> * Update pandas/_libs/src/ujson/lib/ultrajsonenc.c Co-authored-by: William Ayd <[email protected]> * allocate an extra byte in Object_getBigNumStringValue Co-authored-by: William Ayd <[email protected]> * allocate an extra byte in Object_getBigNumStringValue Co-authored-by: William Ayd <[email protected]> * reinstate RESERVE_STRING(szlen) in JT_BIGNUM case * replaced (private) with (public) in whatnew * release bytes in Object_endTypeContext * in JT_BIGNUM change if+if into if+else if * added reallocation of bigNum_bytes * removed bigNum_bytes * added to_json test for ints>sys.maxsize * Use python malloc to match PyObject_Free in endTypeContext Co-authored-by: William Ayd <[email protected]> * TST: added manually constructed strs to compare encodings * fixed styling to minimize diff with master * fixed styling * fixed conflicts with master * fix styling to minimize diff * fix styling to minimize diff * fixed styling * added negative nigNum to test_to_json_large_numers * added negative nigNum to test_to_json_large_numers * Update pandas/tests/io/json/test_ujson.py Co-authored-by: William Ayd <[email protected]> * fixe test_to_json_for_large_nums for -ve * TST: added xfail for ujson.encode with long int input * TST: fixed variable names in test_to_json_large_numbers * TST: added xfail test for json.decode Series with long int * TST: added xfail test for json.decode DataFrame with long int * BENCH: added benchmarks for long ints Co-authored-by: William Ayd <[email protected]>
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Currently this patch causes a significant reduction for a number of the JSON performance benchmarks. A printout of my results is included below.
json_benchmarks_results.txt
I'd love to keep working on this if anybody has ideas for making the solution more efficient!