Skip to content

fix to_json for numbers larger than sys.maxsize #34473

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 86 commits into from
Jun 24, 2020
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
86 commits
Select commit Hold shift + click to select a range
95c20db
BUG: overflow on to_json with numbers larger than sys.maxsize
arw2019 May 29, 2020
6d2f8bd
TST: overflow on to_json with numbers larger than sys.maxsize (#34395)
arw2019 May 30, 2020
4fc5b87
DOC: update with issue #34395
arw2019 May 30, 2020
abfca37
TST: removed unused import
arw2019 Jun 3, 2020
7e63941
ENH: added case JT_BIGNUM to encode
arw2019 Jun 3, 2020
3353420
ENH: added JT_BIGNUM to JSTYPES
arw2019 Jun 3, 2020
c9574b8
BUG: changed error for ints>sys.maxsize into JT_BIGNUM
arw2019 Jun 3, 2020
94c112f
ENH: removed debug statements
arw2019 Jun 3, 2020
76576b8
BUG: removed dumps wrapper
arw2019 Jun 3, 2020
9f211a5
removed bigNum from TypeContext
arw2019 Jun 3, 2020
2b7a271
TST: fixed bug in the test
arw2019 Jun 3, 2020
5e06109
added pointer to string rep converter for BigNum
arw2019 Jun 3, 2020
755ef47
TST: removed ujson.loads from the test
arw2019 Jun 3, 2020
0e768f8
added getBigNumStringValue
arw2019 Jun 4, 2020
12d73b0
added code to JT_BIGNUM handler by analogy with JT_UTF8
arw2019 Jun 4, 2020
6c2aa9f
TST: update pandas/tests/io/json/test_ujson.py
arw2019 Jun 6, 2020
1a8051f
added Object_getBigNumStringValue to pyEncoder
arw2019 Jun 6, 2020
552194e
added skeletal code for Object_GetBigNumStringValue
arw2019 Jun 6, 2020
e2898ef
completed Object_getBigNumStringValue using PyObject_Repr
arw2019 Jun 7, 2020
2943995
BUG: changed Object_getBigNumStringValue
arw2019 Jun 7, 2020
771ec5d
improved Object_getBigNumStringValue some more
arw2019 Jun 10, 2020
92bc6ef
update getBigNumStringValue argument
arw2019 Jun 10, 2020
8f3af8c
corrected Object_getBigNumStringValue
arw2019 Jun 10, 2020
cdae92e
more fixes to Object_getBigNumStringValue
arw2019 Jun 10, 2020
1009168
Update pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 10, 2020
759ad8a
Update pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 10, 2020
5e01ed0
Update pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 10, 2020
8a08a38
Update pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 10, 2020
0441fe7
Update pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 10, 2020
4630c0d
Update pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 10, 2020
2e06a8b
Update pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 10, 2020
63056fc
Update pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 10, 2020
6ec960e
updated pyEncoder for JT_BIGNUM
arw2019 Jun 10, 2020
c63a5c9
updated pyEncoder
arw2019 Jun 17, 2020
b2f8f46
moved getBigNumStringValue to pyEncoder
arw2019 Jun 17, 2020
fea9348
fixed declaration of Object_getBigNumStringValue
arw2019 Jun 18, 2020
6516078
fixed Object_getBigNumStringValue
arw2019 Jun 18, 2020
aa2dbca
catch overflow error with PyLong_AsLongLongAndOverflow
arw2019 Jun 18, 2020
7eaf42d
remove unnecessary error check
arw2019 Jun 18, 2020
56d5bac
added shortcircuit for error check
arw2019 Jun 18, 2020
1cdb1ba
simplify int overflow error catching
arw2019 Jun 18, 2020
821d51f
Update long int test in pandas/tests/io/json/test_ujson.py
arw2019 Jun 18, 2020
1001ac1
removed tests expecting numeric overflow
arw2019 Jun 18, 2020
b8f16b6
remove underscore from overflow
arw2019 Jun 18, 2020
a6e83c7
removed underscores from _overflow everywhere
arw2019 Jun 18, 2020
ccc5b47
fixed small typo
arw2019 Jun 18, 2020
585b985
fix type of exc
arw2019 Jun 18, 2020
7586698
deleted numeric overflow tests
arw2019 Jun 18, 2020
0e6768f
remove extraneous condition in if statement
arw2019 Jun 18, 2020
7c19bd2
remove extraneous condition in if statement
arw2019 Jun 18, 2020
9809d7c
change _Bool into int
arw2019 Jun 18, 2020
2739f3d
Update pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 18, 2020
77d69b7
Update pandas/_libs/src/ujson/lib/ultrajsonenc.c
arw2019 Jun 18, 2020
f003d6b
allocate an extra byte in Object_getBigNumStringValue
arw2019 Jun 19, 2020
ee505c9
allocate an extra byte in Object_getBigNumStringValue
arw2019 Jun 19, 2020
cdc0870
reinstate RESERVE_STRING(szlen) in JT_BIGNUM case
arw2019 Jun 19, 2020
0fba3d5
replaced (private) with (public) in whatnew
arw2019 Jun 19, 2020
259018d
release bytes in Object_endTypeContext
arw2019 Jun 19, 2020
a856a41
in JT_BIGNUM change if+if into if+else if
arw2019 Jun 19, 2020
1bbfdc2
added reallocation of bigNum_bytes
arw2019 Jun 19, 2020
665b146
removed bigNum_bytes
arw2019 Jun 19, 2020
3608297
added to_json test for ints>sys.maxsize
arw2019 Jun 19, 2020
4ab13d6
Merge branch 'master' into json-Overflow-long-int
arw2019 Jun 19, 2020
176f212
Use python malloc to match PyObject_Free in endTypeContext
arw2019 Jun 19, 2020
9b58758
TST: added manually constructed strs to compare encodings
arw2019 Jun 19, 2020
44b79f1
resolve conflicts with master
arw2019 Jun 19, 2020
9cbf596
fixed styling to minimize diff with master
arw2019 Jun 19, 2020
7ee21eb
fixed styling
arw2019 Jun 19, 2020
948170c
pandas/_libs/src/ujson/python/objToJSON.c
arw2019 Jun 19, 2020
ff2e25e
fixed conflicts with master
arw2019 Jun 19, 2020
ce37048
fix styling to minimize diff
arw2019 Jun 19, 2020
2db12c0
fix styling to minimize diff
arw2019 Jun 19, 2020
3e820ac
fixed styling
arw2019 Jun 19, 2020
7afeadb
added negative nigNum to test_to_json_large_numers
arw2019 Jun 19, 2020
e4df0f8
added negative nigNum to test_to_json_large_numers
arw2019 Jun 19, 2020
7b041fe
Update pandas/tests/io/json/test_ujson.py
arw2019 Jun 19, 2020
21d9e98
fixe test_to_json_for_large_nums for -ve
arw2019 Jun 19, 2020
6fd15c9
merge with master
arw2019 Jun 20, 2020
c7acef1
merged with master
arw2019 Jun 21, 2020
2d43001
TST: added xfail for ujson.encode with long int input
arw2019 Jun 22, 2020
6053227
TST: fixed variable names in test_to_json_large_numbers
arw2019 Jun 22, 2020
a688468
TST: added xfail test for json.decode Series with long int
arw2019 Jun 22, 2020
9e1b95f
TST: added xfail test for json.decode DataFrame with long int
arw2019 Jun 22, 2020
cc0dd6a
merge with master
arw2019 Jun 22, 2020
4e53974
BENCH: added benchmarks for long ints
arw2019 Jun 22, 2020
5c96ae4
Merge branch 'master' into json-Overflow-long-int
arw2019 Jun 24, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.1.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -901,6 +901,7 @@ I/O
- Bug in :meth:`~DataFrame.read_feather` was raising an `ArrowIOError` when reading an s3 or http file path (:issue:`29055`)
- Bug in :meth:`~DataFrame.to_excel` could not handle the column name `render` and was raising an ``KeyError`` (:issue:`34331`)
- Bug in :meth:`~SQLDatabase.execute` was raising a ``ProgrammingError`` for some DB-API drivers when the SQL statement contained the `%` character and no parameters were present (:issue:`34211`)
- Bug in :meth:`json.dumps` was raising an `OverflowError` with numbers larger than sys.maxsize (:issue: `34395`)

Plotting
^^^^^^^^
Expand Down
2 changes: 2 additions & 0 deletions pandas/_libs/src/ujson/lib/ultrajson.h
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,7 @@ enum JSTYPES {
JT_INT, // (JSINT32 (signed 32-bit))
JT_LONG, // (JSINT64 (signed 64-bit))
JT_DOUBLE, // (double)
JT_BIGNUM, // integer larger than sys.maxsize
JT_UTF8, // (char 8-bit)
JT_ARRAY, // Array structure
JT_OBJECT, // Key/Value structure
Expand Down Expand Up @@ -187,6 +188,7 @@ typedef struct __JSONObjectEncoder {
JSINT64 (*getLongValue)(JSOBJ obj, JSONTypeContext *tc);
JSINT32 (*getIntValue)(JSOBJ obj, JSONTypeContext *tc);
double (*getDoubleValue)(JSOBJ obj, JSONTypeContext *tc);
const char *(*getBigNumStringValue)(JSOBJ obj, JSONTypeContext *tc);

/*
Begin iteration of an iteratable object (JS_ARRAY or JS_OBJECT)
Expand Down
21 changes: 21 additions & 0 deletions pandas/_libs/src/ujson/lib/ultrajsonenc.c
Original file line number Diff line number Diff line change
Expand Up @@ -1107,6 +1107,27 @@ void encode(JSOBJ obj, JSONObjectEncoder *enc, const char *name,
Buffer_AppendCharUnchecked(enc, '\"');
break;
}

case JT_BIGNUM: {

value = enc->getBigNumStringValue(obj, &tc);

Buffer_Reserve(enc, RESERVE_STRING(szlen));
if (enc->errorMsg) {
enc->endTypeContext(obj, &tc);
return;
}

if (!Buffer_EscapeStringValidated(obj, enc, value,
value + szlen)) {
enc->endTypeContext(obj, &tc);
enc->level--;
return;
}

break;

}
}

enc->endTypeContext(obj, &tc);
Expand Down
41 changes: 40 additions & 1 deletion pandas/_libs/src/ujson/python/objToJSON.c
Original file line number Diff line number Diff line change
Expand Up @@ -1635,7 +1635,10 @@ void Object_beginTypeContext(JSOBJ _obj, JSONTypeContext *tc) {

if (exc && PyErr_ExceptionMatches(PyExc_OverflowError)) {
PRINTMARK();
goto INVALID;
tc->type = JT_BIGNUM;

// This line generates compiler errors.
GET_TC(tc)->cStr = Object_getBigNumStringValue(obj);
}

return;
Expand Down Expand Up @@ -2126,6 +2129,41 @@ double Object_getDoubleValue(JSOBJ Py_UNUSED(obj), JSONTypeContext *tc) {
return GET_TC(tc)->doubleValue;
}

// /*
const char *Object_getBigNumStringValue(JSOBJ obj, JSONTypeContext *tc,
size_t *_outLen) {
/* here goes the code that converts obj into a string
char *wstr;
if (obj<0) {wstr++ = "-";}
int digit;
PyObject rem;
PyObject ten;
long ten_as_long = 10, rem_as_long;
ten = PyNumber_FromLong(ten_as_long);
do {
rem = PyNumber_Remainder(obj, ten);
obj = PyNumber_FloorDivide(obj, ten);

rem_as_long = PyLong_AsLong(rem);
wstr++ = char(48 + (int) rem_as_long);
} while (obj>10);
*/

// we then load that string into tc->cStr
GET_TC(tc)->str = wstr;

/* _outLen: do we set that here?
I can imagine counting the number of digits in the
do-while loop and then setting
_outLen = (number of digits);
I'm not quite sure how that would work though
since _outLen is an argument to this function
*/

return GET_TC(tc)->cStr;
}
// */

static void Object_releaseObject(JSOBJ _obj) { Py_DECREF((PyObject *)_obj); }

void Object_iterBegin(JSOBJ obj, JSONTypeContext *tc) {
Expand Down Expand Up @@ -2179,6 +2217,7 @@ PyObject *objToJSON(PyObject *Py_UNUSED(self), PyObject *args,
Object_endTypeContext,
Object_getStringValue,
Object_getLongValue,
Object_getBigNumStringValue,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line throws a compiler error:

pandas/_libs/src/ujson/python/objToJSON.c:2194:9: error: initialization of ‘JSINT32 (*)(void *, JSONTypeContext *)’ {aka ‘int (*)(void *, struct __JSONTypeContext *)’} from incompatible pointer type ‘const char * (*)(void *, JSONTypeContext *, size_t *)’ {aka ‘const char * (*)(void *, struct __JSONTypeContext *, long unsigned int *)’} [-Werror=incompatible-pointer-types]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to update the PyEncoder struct to have the appropriate member definitions

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can address this and the comment around the free call I think things will work leaking some memory but can continue to address that

NULL, // getIntValue is unused
Object_getDoubleValue,
Object_iterBegin,
Expand Down
1 change: 1 addition & 0 deletions pandas/io/json/_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@
loads = json.loads
dumps = json.dumps


TABLE_SCHEMA_VERSION = "0.20.0"


Expand Down
10 changes: 10 additions & 0 deletions pandas/tests/io/json/test_ujson.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
import locale
import math
import re
import sys
import time

import dateutil
Expand Down Expand Up @@ -559,6 +560,15 @@ def test_encode_long_conversion(self):
assert output == json.dumps(long_input)
assert long_input == ujson.decode(output)

def test_dumps_ints_larger_than_maxsize(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you also add a full integration test in json/test_pandas.py

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also parametrize this to exceed the minimum supported native size

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you do the parametrization of a large negative value here too?

# GH34395
big_num = sys.maxsize + 1
encoding = ujson.dumps(big_num)

assert str(encoding) == json.dumps(big_num)
# ujson.loads to be fixed in the future
# assert ujson.loads(encoding) == big_num

@pytest.mark.parametrize(
"int_exp", ["1337E40", "1.337E40", "1337E+9", "1.337e+40", "1.337E-4"]
)
Expand Down