BUG: Fix pandas-dev#15344 by backporting ujson usage of PEP 393 API

tobgu · AnkurDedania · commit 24e98dff31ed · 2017-03-21T11:17:09.000-05:00
Make use of the PEP 393 API to avoid expanding single byte ascii characters into four byte unicode characters when encoding objects to json. closes pandas-dev#15344 Author: Tobias Gustafsson <tobias.l.gustafsson@gmail.com> Closes pandas-dev#15360 from tobgu/backport-ujson-compact-ascii-encoding and squashes the following commits: 44de133 [Tobias Gustafsson] Fix C-code formatting to pass linting of GH15344 b7e404f [Tobias Gustafsson] Merge branch 'master' into backport-ujson-compact-ascii-encoding 4e8e2ff [Tobias Gustafsson] BUG: Fix pandas-dev#15344 by backporting ujson usage of PEP 393 APIs for compact ascii
diff --git a/doc/source/whatsnew/v0.20.0.txt b/doc/source/whatsnew/v0.20.0.txt
@@ -538,6 +538,8 @@ Bug Fixes
 - Bug in ``pd.pivot_table()`` where no error was raised when values argument was not in the columns (:issue:`14938`)
 
 - Bug in ``.to_json()`` where ``lines=True`` and contents (keys or values) contain escaped characters (:issue:`15096`)
+- Bug in ``.to_json()`` causing single byte ascii characters to be expanded to four byte unicode (:issue:`15344`)
+- Bug in ``.read_json()`` for Python 2 where ``lines=True`` and contents contain non-ascii unicode characters (:issue:`15132`)
 - Bug in ``.rolling/expanding()`` functions where ``count()`` was not counting ``np.Inf``, nor handling ``object`` dtypes (:issue:`12541`)
 - Bug in ``DataFrame.resample().median()`` if duplicate column names are present (:issue:`14233`)
 
@@ -561,7 +563,6 @@ Bug Fixes
 - Bug in ``DataFrame.fillna()`` where the argument ``downcast`` was ignored when fillna value was of type ``dict`` (:issue:`15277`)
 
 
-- Bug in ``.read_json()`` for Python 2 where ``lines=True`` and contents contain non-ascii unicode characters (:issue:`15132`)
 
 - Bug in ``pd.read_csv()`` with ``float_precision='round_trip'`` which caused a segfault when a text entry is parsed (:issue:`15140`)
 
@@ -574,4 +575,6 @@ Bug Fixes
 
 - Bug in ``DataFrame.boxplot`` where ``fontsize`` was not applied to the tick labels on both axes (:issue:`15108`)
 - Bug in ``Series.replace`` and ``DataFrame.replace`` which failed on empty replacement dicts (:issue:`15289`)
+
+
 - Bug in ``.eval()`` which caused multiline evals to fail with local variables not on the first line (:issue:`15342`)
diff --git a/pandas/io/tests/json/test_pandas.py b/pandas/io/tests/json/test_pandas.py
@@ -1044,3 +1044,13 @@ def roundtrip(s, encoding='latin-1'):
 
         for s in examples:
             roundtrip(s)
+
+    def test_data_frame_size_after_to_json(self):
+        # GH15344
+        df = DataFrame({'a': [str(1)]})
+
+        size_before = df.memory_usage(index=True, deep=True).sum()
+        df.to_json()
+        size_after = df.memory_usage(index=True, deep=True).sum()
+
+        self.assertEqual(size_before, size_after)
diff --git a/pandas/src/ujson/python/objToJSON.c b/pandas/src/ujson/python/objToJSON.c
@@ -402,6 +402,16 @@ static void *PyStringToUTF8(JSOBJ _obj, JSONTypeContext *tc, void *outValue,
 static void *PyUnicodeToUTF8(JSOBJ _obj, JSONTypeContext *tc, void *outValue,
                              size_t *_outLen) {
     PyObject *obj = (PyObject *)_obj;
+
+#if (PY_VERSION_HEX >= 0x03030000)
+    if (PyUnicode_IS_COMPACT_ASCII(obj)) {
+        Py_ssize_t len;
+        char *data = PyUnicode_AsUTF8AndSize(obj, &len);
+        *_outLen = len;
+        return data;
+    }
+#endif
+
     PyObject *newObj = PyUnicode_EncodeUTF8(PyUnicode_AS_UNICODE(obj),
                                             PyUnicode_GET_SIZE(obj), NULL);