Skip to content

Commit 24e98df

Browse files
tobguAnkurDedania
authored andcommitted
BUG: Fix pandas-dev#15344 by backporting ujson usage of PEP 393 API
Make use of the PEP 393 API to avoid expanding single byte ascii characters into four byte unicode characters when encoding objects to json. closes pandas-dev#15344 Author: Tobias Gustafsson <[email protected]> Closes pandas-dev#15360 from tobgu/backport-ujson-compact-ascii-encoding and squashes the following commits: 44de133 [Tobias Gustafsson] Fix C-code formatting to pass linting of GH15344 b7e404f [Tobias Gustafsson] Merge branch 'master' into backport-ujson-compact-ascii-encoding 4e8e2ff [Tobias Gustafsson] BUG: Fix pandas-dev#15344 by backporting ujson usage of PEP 393 APIs for compact ascii
1 parent 18bc171 commit 24e98df

File tree

3 files changed

+24
-1
lines changed

3 files changed

+24
-1
lines changed

doc/source/whatsnew/v0.20.0.txt

+4-1
Original file line numberDiff line numberDiff line change
@@ -538,6 +538,8 @@ Bug Fixes
538538
- Bug in ``pd.pivot_table()`` where no error was raised when values argument was not in the columns (:issue:`14938`)
539539

540540
- Bug in ``.to_json()`` where ``lines=True`` and contents (keys or values) contain escaped characters (:issue:`15096`)
541+
- Bug in ``.to_json()`` causing single byte ascii characters to be expanded to four byte unicode (:issue:`15344`)
542+
- Bug in ``.read_json()`` for Python 2 where ``lines=True`` and contents contain non-ascii unicode characters (:issue:`15132`)
541543
- Bug in ``.rolling/expanding()`` functions where ``count()`` was not counting ``np.Inf``, nor handling ``object`` dtypes (:issue:`12541`)
542544
- Bug in ``DataFrame.resample().median()`` if duplicate column names are present (:issue:`14233`)
543545

@@ -561,7 +563,6 @@ Bug Fixes
561563
- Bug in ``DataFrame.fillna()`` where the argument ``downcast`` was ignored when fillna value was of type ``dict`` (:issue:`15277`)
562564

563565

564-
- Bug in ``.read_json()`` for Python 2 where ``lines=True`` and contents contain non-ascii unicode characters (:issue:`15132`)
565566

566567
- Bug in ``pd.read_csv()`` with ``float_precision='round_trip'`` which caused a segfault when a text entry is parsed (:issue:`15140`)
567568

@@ -574,4 +575,6 @@ Bug Fixes
574575

575576
- Bug in ``DataFrame.boxplot`` where ``fontsize`` was not applied to the tick labels on both axes (:issue:`15108`)
576577
- Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`)
578+
579+
577580
- Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)

pandas/io/tests/json/test_pandas.py

+10
Original file line numberDiff line numberDiff line change
@@ -1044,3 +1044,13 @@ def roundtrip(s, encoding='latin-1'):
10441044

10451045
for s in examples:
10461046
roundtrip(s)
1047+
1048+
def test_data_frame_size_after_to_json(self):
1049+
# GH15344
1050+
df = DataFrame({'a': [str(1)]})
1051+
1052+
size_before = df.memory_usage(index=True, deep=True).sum()
1053+
df.to_json()
1054+
size_after = df.memory_usage(index=True, deep=True).sum()
1055+
1056+
self.assertEqual(size_before, size_after)

pandas/src/ujson/python/objToJSON.c

+10
Original file line numberDiff line numberDiff line change
@@ -402,6 +402,16 @@ static void *PyStringToUTF8(JSOBJ _obj, JSONTypeContext *tc, void *outValue,
402402
static void *PyUnicodeToUTF8(JSOBJ _obj, JSONTypeContext *tc, void *outValue,
403403
size_t *_outLen) {
404404
PyObject *obj = (PyObject *)_obj;
405+
406+
#if (PY_VERSION_HEX >= 0x03030000)
407+
if (PyUnicode_IS_COMPACT_ASCII(obj)) {
408+
Py_ssize_t len;
409+
char *data = PyUnicode_AsUTF8AndSize(obj, &len);
410+
*_outLen = len;
411+
return data;
412+
}
413+
#endif
414+
405415
PyObject *newObj = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(obj),
406416
PyUnicode_GET_SIZE(obj), NULL);
407417

0 commit comments

Comments
 (0)