Skip to content

Calling DataFrame.to_json() increases data frame memory usage in Python 3.6 #15344

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tobgu opened this issue Feb 8, 2017 · 5 comments
Closed
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Milestone

Comments

@tobgu
Copy link
Contributor

tobgu commented Feb 8, 2017

Code Sample, a copy-pastable example if possible

Python 3.6.0 (default, Dec 29 2016, 21:40:24) 
[GCC 4.9.4] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [str(i) for i in range(10000)]})
>>> df.memory_usage(index=True, deep=True)
Index        80
a        608890
dtype: int64
>>> j = df.to_json()
>>> df.memory_usage(index=True, deep=True)
Index        80
a        804450
dtype: int64

Compared to Python 2.7.12

Python 2.7.12 (default, Jul 18 2016, 15:02:52) 
[GCC 4.8.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas as pd
>>> df = pd.DataFrame({'a': [str(i) for i in range(10000)]})
>>> df.memory_usage(index=True, deep=True)
Index        72
a        488890
dtype: int64
>>> j = df.to_json()
>>> df.memory_usage(index=True, deep=True)
Index        72
a        488890
dtype: int64

Problem description

Calling to_json should not have any impact on the reported memory usage of a DataFrame. Just like in Python 2. The observed increase above is 32% which is really high.

This only seems to happen with dataframes that have strings in them.

I've also tested calling to_csv, that does not trigger this behaviour.

Furthermore it seems like the memory usage is quite a lot higher in Python 3 compared to the equivalent data frame in Python 2 (~25% in the example above). I guess this is more related to strings in Python 2 vs Python 3 than Pandas though?

Expected Output

No change in reported memory usage after calling to_json.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None python: 3.6.0.final.0 python-bits: 64 OS: Linux OS-release: 3.13.0-83-generic machine: x86_64 processor: x86_64 byteorder: little LC_ALL: None LANG: en_US.UTF-8 LOCALE: en_US.UTF-8

pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.1.0
Cython: None
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None

@chris-b1
Copy link
Contributor

chris-b1 commented Feb 8, 2017

Partly guessing, but I think https://www.python.org/dev/peps/pep-0393/ could be the culprit. Python 3 can choose a compact string representation, then the C API calls the json code is using could force it into a less compact representation.

https://docs.python.org/3/whatsnew/3.3.html#pep-393-flexible-string-representation

@chris-b1 chris-b1 added 2/3 Compat IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance labels Feb 8, 2017
@jreback
Copy link
Contributor

jreback commented Feb 8, 2017

the absolute sizes of memory used are dependent on py2/py3 as @chris-b1 indicates.

So strings are held in python space, backed by pointers from numpy. This IS using the c-api; I would hazard a guess that strings that were formerly interned are now no longer. This is pretty deep.

If anyone wants to investigate please do so. Though I would say this is out of pandas hands.

@jreback jreback added this to the Someday milestone Feb 8, 2017
@chris-b1
Copy link
Contributor

chris-b1 commented Feb 8, 2017

It does look like ujson was updated to use the new C API - so if you wanted to try porting this back in, may not be too hard.

ours -

static void *PyUnicodeToUTF8(JSOBJ _obj, JSONTypeContext *tc, void *outValue,

theirs - https://github.com/esnme/ultrajson/blob/2f1d4874f4f4d2a40a460678004c80e69387c663/python/objToJSON.c#L143

@tobgu
Copy link
Contributor Author

tobgu commented Feb 8, 2017

Yes, there seems to be quite a few updates in uson since the version that is present in Pandas. It does not seem to me like it has with interning to do but rater that single byte ascii characters are turned into full blown 4 byte unicode characters based on the basic experiment below.

Having a dataframe dominated by long strings would hence mean that the memory usage would be quadrupled after calling to_json compared to before.

>>> df = pd.DataFrame({'a': [str(1)]})
>>> df.memory_usage(index=True, deep=True)
Index    80
a        58
dtype: int64
>>> df.to_json()
'{"a":{"0":"1"}}'
>>> df.memory_usage(index=True, deep=True)
Index    80
a        66
dtype: int64
>>> df = pd.DataFrame({'a': [str(11)]})
>>> df.to_json()
'{"a":{"0":"11"}}'
>>> df = pd.DataFrame({'a': [str(11)]})
>>> df.memory_usage(index=True, deep=True)
Index    80
a        59
dtype: int64
>>> df.to_json()
'{"a":{"0":"11"}}'
>>> df.memory_usage(index=True, deep=True)
Index    80
a        71
dtype: int64
>>> df = pd.DataFrame({'a': [str(111)]})
>>> df.memory_usage(index=True, deep=True)
Index    80
a        60
dtype: int64
>>> df.to_json()
'{"a":{"0":"111"}}'
>>> df.memory_usage(index=True, deep=True)
Index    80
a        76
dtype: int64
>>> 

tobgu added a commit to tobgu/pandas that referenced this issue Feb 9, 2017
tobgu added a commit to tobgu/pandas that referenced this issue Feb 9, 2017
tobgu added a commit to tobgu/pandas that referenced this issue Feb 9, 2017
@jreback jreback modified the milestones: 0.20.0, Someday Feb 10, 2017
AnkurDedania pushed a commit to AnkurDedania/pandas that referenced this issue Mar 21, 2017
Make use of the PEP 393 API to avoid expanding single byte ascii
characters into four byte unicode characters when encoding objects to
json.

closes pandas-dev#15344

Author: Tobias Gustafsson <[email protected]>

Closes pandas-dev#15360 from tobgu/backport-ujson-compact-ascii-encoding and squashes the following commits:

44de133 [Tobias Gustafsson] Fix C-code formatting to pass linting of GH15344
b7e404f [Tobias Gustafsson] Merge branch 'master' into backport-ujson-compact-ascii-encoding
4e8e2ff [Tobias Gustafsson] BUG: Fix pandas-dev#15344 by backporting ujson usage of PEP 393 APIs for compact ascii
@somic
Copy link

somic commented Feb 16, 2018

I am observing what looks like a very similar memory leak with to_json on python 2.7.6 (ships with ubuntu trusty) and pandas 0.22.0. Anybody seen this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants