-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Memory leak in df.to_json
#24889
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Source is at https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/ujson/python/objToJSON.c is you're interested in debugging further. |
FWIW this seems to take a ton of iterations and doesn't really leak much memory, but investigations are welcome. |
It's also worth isolating the to_json part from the `df.T` part.
…On Thu, Jan 24, 2019 at 4:03 PM chris-b1 ***@***.***> wrote:
FWIW this seems to take a ton of iterations and doesn't really leak much
memory, but investigations are welcome.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#24889 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIudiRAcurMKPP8cCqhjRO7t3FAzxks5vGi26gaJpZM4aPu6O>
.
|
@TomAugspurger sorry, that was left over from some other original code. The leak happens with or without the transpose. |
Does it also happen with other dtypes? How about an empty DataFrame? |
Memory use is stable for an empty dataframe |
In simple cases, there's an easy workaround. You can just use |
Looked at this in more detail as I was reviewing the JSON integration. As mentioned above this issue does not affect empty dataframes, nor does it affect non-numeric frames. This leaks: pd.DataFrame([[1]]).to_json() But this doesn't pd.DataFrame([['a']]) I believe I've isolated the issue to this line of code:
Removing that prevented the example from the OP from leaking. Unfortunately it caused segfaults in the rest of the code base so have to debug that further, but will push a PR if I can figure it out. @jbrockmendel you might have some thoughts here as well. You called this condition out as looking wonky before and it does seem to cause the leak here, just need to figure out what the actual intent is |
I'm still seeing this memory leak on A slightly modified version of the original reproduction script:
|
I'm also seeing this, using modified @ohinds's code:
With running
|
Can confirm this as well, in pandas 1.2.3, numeric column. Dumping dataframe is fine, dumping a series of numerics is leaking. |
Can confirm same behaviour with python 3.8.7 / pandas 1.3.4 on Windows 10(64) |
Still a problem Mac OS, python 3.8, pandas 1.0.5 |
Still a problem on Windows, python 3.9 / pandas 1.2.3 |
Upgrading to pandas 1.4.3 fixes the problem. |
Code Sample, a copy-pastable example if possible
Problem description
If we repeatedly call
to_json()
on a dataframe, memory usage grows continuously:Expected Output
I would expect memory usage to stay constant
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]/usr/local/lib/python3.6/dist-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")
INSTALLED VERSIONS
commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: 0.25.2
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
The text was updated successfully, but these errors were encountered: