-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
Calling DataFrame.to_json() increases data frame memory usage in Python 3.6 #15344
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Partly guessing, but I think https://www.python.org/dev/peps/pep-0393/ could be the culprit. Python 3 can choose a compact string representation, then the C API calls the json code is using could force it into a less compact representation. https://docs.python.org/3/whatsnew/3.3.html#pep-393-flexible-string-representation |
the absolute sizes of memory used are dependent on py2/py3 as @chris-b1 indicates. So strings are held in python space, backed by pointers from numpy. This IS using the c-api; I would hazard a guess that strings that were formerly interned are now no longer. This is pretty deep. If anyone wants to investigate please do so. Though I would say this is out of pandas hands. |
It does look like ours - pandas/pandas/src/ujson/python/objToJSON.c Line 402 in bf1a596
|
Yes, there seems to be quite a few updates in Having a dataframe dominated by long strings would hence mean that the memory usage would be quadrupled after calling >>> df = pd.DataFrame({'a': [str(1)]})
>>> df.memory_usage(index=True, deep=True)
Index 80
a 58
dtype: int64
>>> df.to_json()
'{"a":{"0":"1"}}'
>>> df.memory_usage(index=True, deep=True)
Index 80
a 66
dtype: int64
>>> df = pd.DataFrame({'a': [str(11)]})
>>> df.to_json()
'{"a":{"0":"11"}}'
>>> df = pd.DataFrame({'a': [str(11)]})
>>> df.memory_usage(index=True, deep=True)
Index 80
a 59
dtype: int64
>>> df.to_json()
'{"a":{"0":"11"}}'
>>> df.memory_usage(index=True, deep=True)
Index 80
a 71
dtype: int64
>>> df = pd.DataFrame({'a': [str(111)]})
>>> df.memory_usage(index=True, deep=True)
Index 80
a 60
dtype: int64
>>> df.to_json()
'{"a":{"0":"111"}}'
>>> df.memory_usage(index=True, deep=True)
Index 80
a 76
dtype: int64
>>> |
Make use of the PEP 393 API to avoid expanding single byte ascii characters into four byte unicode characters when encoding objects to json. closes pandas-dev#15344 Author: Tobias Gustafsson <[email protected]> Closes pandas-dev#15360 from tobgu/backport-ujson-compact-ascii-encoding and squashes the following commits: 44de133 [Tobias Gustafsson] Fix C-code formatting to pass linting of GH15344 b7e404f [Tobias Gustafsson] Merge branch 'master' into backport-ujson-compact-ascii-encoding 4e8e2ff [Tobias Gustafsson] BUG: Fix pandas-dev#15344 by backporting ujson usage of PEP 393 APIs for compact ascii
I am observing what looks like a very similar memory leak with to_json on python 2.7.6 (ships with ubuntu trusty) and pandas 0.22.0. Anybody seen this? |
Code Sample, a copy-pastable example if possible
Compared to Python 2.7.12
Problem description
Calling
to_json
should not have any impact on the reported memory usage of a DataFrame. Just like in Python 2. The observed increase above is 32% which is really high.This only seems to happen with dataframes that have strings in them.
I've also tested calling
to_csv
, that does not trigger this behaviour.Furthermore it seems like the memory usage is quite a lot higher in Python 3 compared to the equivalent data frame in Python 2 (~25% in the example above). I guess this is more related to strings in Python 2 vs Python 3 than Pandas though?
Expected Output
No change in reported memory usage after calling
to_json
.Output of
pd.show_versions()
pandas: 0.19.2
nose: None
pip: 9.0.1
setuptools: 34.1.0
Cython: None
numpy: 1.12.0
scipy: None
statsmodels: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.6.0
pytz: 2016.10
blosc: None
bottleneck: None
tables: None
numexpr: 2.6.2
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
boto: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: