Skip to content

Memory leak in df.to_json #24889

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
skatenerd opened this issue Jan 23, 2019 · 16 comments · Fixed by #26239
Closed

Memory leak in df.to_json #24889

skatenerd opened this issue Jan 23, 2019 · 16 comments · Fixed by #26239
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Milestone

Comments

@skatenerd
Copy link

Code Sample, a copy-pastable example if possible

import pandas as pd                                                                                                                              
import numpy as np


df = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))

while True:
    body = df.T.to_json()
    print("HI")

Problem description

If we repeatedly call to_json() on a dataframe, memory usage grows continuously:

image

Expected Output

I would expect memory usage to stay constant

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]
/usr/local/lib/python3.6/dist-packages/psycopg2/init.py:144: UserWarning: The psycopg2 wheel package will be renamed from release 2.8; in order to keep installing from binary please use "pip install psycopg2-binary" instead. For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")

INSTALLED VERSIONS

commit: None
python: 3.6.8.final.0
python-bits: 64
OS: Linux
OS-release: 4.15.0-43-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: C.UTF-8
LANG: C.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: None
pip: 18.1
setuptools: 40.6.3
Cython: 0.25.2
numpy: 1.15.3
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.1.1
sphinx: None
patsy: 0.5.1
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.0.1
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: 2.7.5 (dt dec pq3 ext lo64)
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@TomAugspurger
Copy link
Contributor

Source is at https://github.com/pandas-dev/pandas/blob/master/pandas/_libs/src/ujson/python/objToJSON.c is you're interested in debugging further.

@chris-b1
Copy link
Contributor

FWIW this seems to take a ton of iterations and doesn't really leak much memory, but investigations are welcome.

@chris-b1 chris-b1 added Performance Memory or execution speed performance IO JSON read_json, to_json, json_normalize labels Jan 24, 2019
@chris-b1 chris-b1 added this to the Contributions Welcome milestone Jan 24, 2019
@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jan 24, 2019 via email

@skatenerd
Copy link
Author

@TomAugspurger sorry, that was left over from some other original code. The leak happens with or without the transpose.

@jbrockmendel
Copy link
Member

Does it also happen with other dtypes? How about an empty DataFrame?

@skatenerd
Copy link
Author

Memory use is stable for an empty dataframe

@skatenerd
Copy link
Author

skatenerd commented Jan 25, 2019

In simple cases, there's an easy workaround. You can just use df.to_dict() and pass it to python's json.dump. You'll have to make some manual changes, such as manually serializing pandas timestamps.

@WillAyd
Copy link
Member

WillAyd commented Apr 27, 2019

Looked at this in more detail as I was reviewing the JSON integration. As mentioned above this issue does not affect empty dataframes, nor does it affect non-numeric frames.

This leaks:

pd.DataFrame([[1]]).to_json()

But this doesn't

pd.DataFrame([['a']])

I believe I've isolated the issue to this line of code:

Removing that prevented the example from the OP from leaking. Unfortunately it caused segfaults in the rest of the code base so have to debug that further, but will push a PR if I can figure it out.

@jbrockmendel you might have some thoughts here as well. You called this condition out as looking wonky before and it does seem to cause the leak here, just need to figure out what the actual intent is

@LuisLuii
Copy link

LuisLuii commented Oct 6, 2020

hi, i am facing the same issue about memory leak in df.to_json().

Here I am using df.to_dict() and pass it to python's json.dump, the memory use is stable
to-json-workaround

But when I use the df.to_json()
using_to_json

Code Sample

import json

import pandas as pd


def list_to_df_json( data):
        point_classified = {}
        for i in data:
                if i['point_id'] not in point_classified:
                        point_classified[i['point_id']] = {}
                point_classified[i['point_id']][i['timestamp']] = i['point_value']
        return point_classified


def boo(a):

        data = list_to_df_json(a)

        for point_id, point_value_of_that_id in data.items():
                # logging.info(f"pushing data from pointid : {point_id} ")
                df = pd.DataFrame.from_dict(point_value_of_that_id, orient='index', columns=[point_id])
                
                # dict_df = df.to_dict(orient='index')
                # workaround
                # json_df = json.dumps(dict_df)
                
                # memory leak
                json_df = df.to_json(orient='index')
        return json_df

while 1:
      a = [{'point_id': 'a',
              'point_value': 346.9,
              'timestamp': '2019-12-01 08:15:00'},
             {'point_id': 'a',
              'point_value': 247.2,
              'timestamp': '2019-12-01 08:30:00'},
             {'point_id': 'a',
              'point_value': 237.9,
              'timestamp': '2019-12-01 08:45:00'},
             {'point_id': 'a',
              'point_value': 215.2,
              'timestamp': '2019-12-01 09:00:00'},
             {'point_id': 'b',
              'point_value': 276.8,
              'timestamp': '2019-12-01 09:15:00'},
             {'point_id': 'b',
              'point_value': 296.1,
              'timestamp': '2019-12-01 09:30:00'},
             {'point_id': 'b',
              'point_value': 328.0,
              'timestamp': '2019-12-01 09:45:00'}]

        print(boo(a))
        # pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.10.final.0
python-bits: 64
OS: Darwin
OS-release: 19.0.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: None
LOCALE: en_GB.UTF-8

pandas: 0.24.2
pytest: 5.4.3
pip: 19.3.1
setuptools: 44.0.0.post20200106
Cython: None
numpy: 1.19.1
scipy: 1.5.2
pyarrow: None
xarray: None
IPython: None
sphinx: None
patsy: None
dateutil: 2.8.1
pytz: 2019.3
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 3.3.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: 4.5.0
bs4: 4.8.2
html5lib: None
sqlalchemy: 1.3.18
pymysql: 0.9.3
psycopg2: 2.7.7 (dt dec pq3 ext lo64)
jinja2: 2.11.2
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@ohinds
Copy link

ohinds commented Nov 17, 2020

I'm still seeing this memory leak on to_json() as of 1.1.4.

A slightly modified version of the original reproduction script:

import pandas as pd
import numpy as np
import memory_utils

df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))

while True:
    # leak
    body = df.to_json()

    # no leak
    # body = df.to_dict()

    memory_utils.print_memory('to_json leak test')
INSTALLED VERSIONS
------------------
commit           : 67a3d4241ab84419856b84fc3ebc9abcbe66c6b3
python           : 3.8.6.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.9.8-arch1-1
Version          : #1 SMP PREEMPT Tue, 10 Nov 2020 22:44:11 +0000
machine          : x86_64
processor        :
byteorder        : little
LC_ALL           : None
LANG             : en_US.UTF-8
LOCALE           : en_US.UTF-8

pandas           : 1.1.4
numpy            : 1.19.4
pytz             : 2020.4
dateutil         : 2.8.1
pip              : 20.2.4
setuptools       : 50.3.2
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : None
IPython          : None
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : None
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pyxlsb           : None
s3fs             : None
scipy            : None
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

@mikicz
Copy link

mikicz commented Mar 15, 2021

I'm also seeing this, using modified @ohinds's code:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0, 100, size=(100, 4)), columns=list('ABCD'))

for i in range(1000000):
    # leak
    # body = df.to_json()

    # no leak
    body = df.to_dict()

With running to_json():

image

With running to_dict():
image

INSTALLED VERSIONS
------------------
commit           : f2c8480af2f25efdbd803218b9d87980f416563e
python           : 3.9.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.10.21-200.fc33.x86_64
Version          : #1 SMP Mon Mar 8 00:24:40 UTC 2021
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.2.3
numpy            : 1.20.1
pytz             : 2021.1
dateutil         : 2.8.1
pip              : 20.2.2
setuptools       : 49.1.3
Cython           : 0.29.22
pytest           : 6.2.2
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : 4.6.2
html5lib         : 1.1
pymysql          : None
psycopg2         : 2.8.6 (dt dec pq3 ext lo64)
jinja2           : 2.11.3
IPython          : 7.21.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fsspec           : None
fastparquet      : None
gcsfs            : None
matplotlib       : 3.3.4
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pyxlsb           : None
s3fs             : None
scipy            : 1.6.1
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
numba            : None

@klichukb
Copy link

Can confirm this as well, in pandas 1.2.3, numeric column. Dumping dataframe is fine, dumping a series of numerics is leaking.

@fcs69
Copy link

fcs69 commented Nov 4, 2021

Can confirm same behaviour with python 3.8.7 / pandas 1.3.4 on Windows 10(64)

@arturzangiev
Copy link

Still a problem Mac OS, python 3.8, pandas 1.0.5

@saeedghadiri
Copy link

Still a problem on Windows, python 3.9 / pandas 1.2.3

@saeedghadiri
Copy link

Upgrading to pandas 1.4.3 fixes the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO JSON read_json, to_json, json_normalize Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.